HACKER Q&A
📣 wreath

Do you find working on large distributed systems exhausting?


Ive been working on large distributed system for the last 4-5 years with teams owning few services or have different responsibilities to keep the system up and running. We run into very interesting problems due to scale (billions of requests per month for our main public apis) and the large amount of data we deal with.

I think it has progressed my career and expanded my skills but I feel it's pretty damn exhausting to manage all this even when following a lot of the best-practices and working with other highly skilled engineers.

I've been wondering recently if others feel this kind of burnout (for lack of better word). Is the expectation is that your average engineer should now be able to handle all this?


  👤 fxtentacle Accepted Answer ✓
Yes, I used to,

but No, I fixed it :)

Among other things, I am team lead for a private search engine whose partner-accessible API handles roughly 500 mio requests per month.

I used to feel powerless and stressed out by the complexity and the scale, because whenever stuff broke (and it always does at this scale), I had to start playing politics, asking for favors, or threatening people on the phone to get it fixed. Higher management would hold me accountable for the downtime even when the whole S3 AZ was offline and there was clearly nothing I could do except for hoping that we'll somehow reach one of their support engineers.

But over time, management's "stand on the shoulders of giants" brainwashing wore off so that they actually started to read all the "AWS outage XY" information that we forwarded to them. They started to actually believe us when we said "Nothing we can do, call Amazon!". And then, I found a struggling hosting company with almost compatible tooling and we purchased them. And I moved all of our systems off the public cloud and onto our private cloud hosting service.

Nowadays, people still hold me (at least emotionally) accountable for any issue or downtime, but I feel much better about it :) Because now it actually is withing my circle of power. I have root on all relevant servers, so if shit hits the fan, I can fix things or delegate to my team.

Your situation sounds like you will constantly take the blame for other people's fault. I would imagine that to be disheartening and extremely exhausting.


👤 solatic
What do you find exhausting?

One anti-pattern I've found is that most orgs ask a single team to handle on-call around the clock for their service. This rarely scales well, from a human standpoint. If you're getting paged at 2:00 in the morning on a regular basis you will start to resent it. There's not much you can do about that so long as only one team is responsible for uptime 24/7.

The solution is to hire operations teams globally, and then setup follow-the-sun operations whereby the people being paged are always naturally awake at that hour, and allows them to work normal eight hour shifts. But this requires companies to, gasp, have specialized developers and specialized operators collaborate before allowing new feature work into production, to ensure that the operations teams understand what the services are supposed to do and keep it all online. It requires (oh, the horror!) actually maintaining production standards, runbooks, and other documentation.

So naturally, many orgs would prefer to burn out their engineers instead.


👤 thebackup
My experience is that the expectations on what your average engineer should be able to handle has grown enormously during the last 10 years or so. Working both with large distributed systems and medium size monolithic systems I have seen the expectations become a lot higher in both.

When I started my career the engineers at our company were assigned a very specific part of the product that they were experts on. Usually there were 1 or 2 engineers assigned to a specific area and they knew it really well. Then we went Agile(tm) and the engineers were grouped into 6 to 9 person teams that were assigned features that spanned several areas of the product. The teams also got involved in customer interaction, planning, testing and documentation. The days when you could focus on a single part of the system and become really good at it were gone.

Next big change came when the teams moved from being feature teams to devops teams. None of the previous responsibilities were removed but we now became responsible also for setting up and running the (cloud) infrastructure and deploying our own software.

In some ways I agree that these changes have empowered us. But it is also, as you say, exhausting. Once I was simply a programmer; now I'm a domain expert, project manager, programmer, tester, technical writer, database admin, operations engineer, and so on.


👤 heisenbit
In these large scale systems the boundaries are usually not well defined (there are APIs but data flowing through the APIs is another matter as are operational and non functional requirements).

Stress is often caused by a mismatch of what you feel responsible and accountable for and what you really control. The more you know the more you feel responsible for but you are rarely able to expand control as much or as fast as your knowledge. It helps to be very clear about where you have ultimate say (accountability) or control within some framework (responsibility) or simply know and contribute. Clear in your mind, others and your boss. Look at areas outside your responsibility with curiosity and willingness to offer support but know that you are not responsible and others need to worry.


👤 hliyan
The first ten years of my career, I worked with distributed systems built on this stack: C++, Oracle, Unix (and to some extent, MFC and Qt). There were hundreds of instances of dozens of different type of processes (we would now call these microservices) connected via TCP links, running on hundreds of servers. I seldom found this exhausting.

The second ten years of my career, I worked with (and continue to work on) much more simpler systems, but the stack looks like this: React/Angular/Vue.js, Node.js/SpringBoot, MongoDB/MySQL/PostGreSQL, ElasticSearch, Redis, AWS (about a dozen services right here), Docker, Kubenetes. _This_ is exhausting.

When you spend so much time wrangling a zoo of commercial products, each with its own API and often own vocabulary for what should be industry standards (think TCP/IP, ANSI, ECMA, SQL), and being constantly obsoleted by competing "latest" products, that you don't have enough time to focus on code, then yes, it can be exhausting.


👤 qxmat
I've found that external tech requirements are horrible to work with, especially when the underlying stack simply doesn't support it. Normally these are pushed by certified cloud consultants or by an intrepid architect who found another "best practice blog."

It's begins with small requirements such as coming up with a disaster recovery plan only for it to be rejected because your stack must "automatically heal" and devs can't be trusted to restore a backup during an emergency.

Blink and you're implementing redundant networking (cross AZ route tables, DNS failover, SDN via gateways/load balancers), a ZooKeeper ensemble with >= 3 nodes in 3 AZs, per service health checks, EFS/FSX network mounts for persistent data that expensive enterprise app insists storing on-disk and some kind of HA database/multi-master SQL cluster.

... months and months of work because a 2 hour manual restore window is unacceptable. And when the dev work is finally complete after 20 zero-downtime releases over 6 months (bye weekend!) how does it perform? Abysmally - DNS caching left half the stack unreachable (partial data loss) and the mission critical Jira Server fail-over node has the wrong next-sequence id because Jira uses an actual fucking sequence table (fuck you Atlassian - fuck you!).

If only the requirement was for a DR run-book + regular fire drills.


👤 nikhilsimha
I used to lead teams that owned message bus, a stream processing framework and a distributed scheduler (like k8s) at Facebook.

The oncall was brutal. At some point I thought I should work on something else, perhaps even switch careers entirely. However this also forced us to separate user issues and system issues accurately. That’s only possible because we are a platform team. Since then I regained my love for distributed systems.

Another thing is, we had to cut down on the complexity - reduce number of services that talked to each other to a bare minimum. Weigh features for their impact vs. their complexity. And regularly rewrite stuff to reduce complexity.

Now Facebook being Facebook, valued speed and complexity over stability and simplicity. Specially when it comes to career growth discussions. So it’s hard to build good infra in the company.


👤 wilde
Without more info it’s hard to say. When I felt like this, a manager recommended I start journaling my energy. I kept a Google doc with sections for each week. In each section, there’s a bulleted list of things I did that gave me energy and a list of things I did that took energy.

Once you have a few lists some trends become clear and you can work with your manager to shift where you spend time.


👤 m_herrlich
I love building and developing software, and despite the fun and interesting challenges presented at my last job I quit because of the operations component. We adopted DevOps and it felt like "building" got replaced with "configuring" and managing complex configurations does not tickle my brain at all. Week-long on-call shifts are like being under house arrest 24/7.

I understand the value that developers bring to operational roles, and to some extent making developers feel the pain of their screwups is appropriate. But when DevOps is 80% Ops, you need a fundamentally different kind of developer.


👤 jmyeet
It's hard to answer this because you don't specify what exactly you find exhausting. Is it oncall? Deployment? Performance issues? Dealing with different teams? Failures and recovery? The right hand not knowing what the left hand is doing? Too many services? Something else?

It's not even clear how big your service is. You mention billions of requests per month. Every 1B requests/month translates to ~400 QPS, which isn't even that large. Like, that's single server territory. Obviously spikiness matters. I'd also be curious what you mean by "large amount of data".


👤 jedberg
I find it exhilarating, but you have to have a well architected distributed system. Some key points:

- Your micro service should be able to run independently. No shared data storage, no direct access into other microservices' storage.

- Your service should protect itself from other services, rejecting requests before it becomes overloaded.

- Your service should be lenient on the data it accepts from other services, but strict about what it sends.

- Your service should be a good citizen, employing good backoffs when other services it is calling appear overloaded.

- The API should be the contract and fully describe your service's relationship to the other services. You should absolutely collaborate with the engineers who make other services, but at the end of the day anything you agree on should be built into the API.

Generally if you follow these best practices, you shouldn't have to maintain a huge working knowledge of the system, only detailed knowledge of your part, which should be small enough to fit into your mental model.

There will be a small team of people responsible for the entire system and how it fits together, but ideally if everyone is following these practices, they won't need to know details of any system, only how to read the APIs and the call graph and how the pieces fit together.


👤 sillysaurusx
Jobs aren’t exhausting. Teams are. If you find yourself feeling this way, consider that the higher ups may be mismanaging.

There’s often not a lot of organizational pressure to change anything. So the status quo stays static. But the services change over time, so the status quo needs to change with them.


👤 angarg12
I don't find it exhausting, I find it *exhilarating*.

After years of proving myself, earning trust and strategical positioning I am finally leading a system that will support millions of requests per second. I love my job and this is the most intellectually stimulating activity I have done in a long while.

I think this is far from the expectation of the average engineer. You can find many random companies with very menial and low stake work. However if you work at certain companies you sign up for this.

BTW I don't think this is unreasonable. This is precisely why programmers get paid big bucks, definitely in the US. We have have a set of skills that require a lot of talent and effort, and we are rewarded for it.

Bottom line this isn't for everyone, so if you feel you are done with it that's fair. Shop around for jobs and be deliberate about where you choose to work, and you will be fine.


👤 artiscode
Your story is close to home. I was part of a team that integrated our newly-acquired startup with a massive, complex and needlessly distributed enterprise system that burned me out.

Being forced to do things that absolutely did not make sense(CS wise) was what I found to be most exhausting. Having no other way than writing shitty code or copying functionality into our app led me to an eventual burnout. My whole career felt pointless as I was unable to apply any of my skills and expertise that I learned over all these years, because everything was designed in a complex way. Getting a single property into an internal API is not a trivial task and requires coordination from different teams as there are a plethora of processes in place. However I helped to build a monstrous integration layer and everything wrong with it is partly my doing. Hindsight is 20/20 and I now see there really was no other, better way to do it, which feels nice in a schadenfreude kind of way.

I sympathise with your point about not understanding what is expected of an average engineer nowadays. Should you take initiative and help manage things, are you allowed to simply write code and what should you expect from others were amongst my pain points. I certainly did not feel rewarded for going the extra mile, but somehow felt obliged because of my "senior" title.

I took therapy, worked on side projects and I'm now trying out a manager role. My responsibilities are pretty much the same, but I don't have to write code anymore. It feels empowering to close my laptop after my last Zoom meeting and not think about bugs, code, CI or merging tomorrow morning because it's release day tomorrow.

But hey, grass is always greener on the other side! I think taking therapy was one of my life's best decisions after being put through the ringer. Perhaps it will help you as well!


👤 throwaway984393
It's exhausting when the business does not give you the support you need and leans on you to do too much work. Find another place to work where they do things without stress (ask them in the interview about their stress levels and workload). Make sure leadership are actively prioritizing work that shores up fundamental reliability and continuously improves response to failure.

When things aren't a tire fire, people will still ask you to do too much work. The only way to deal with it without stress is to create a funnel.

Require all new requests come as a ticket. Keep a meticulously refined backlog of requests, weighted by priorities, deadlines and blockers. Plan out work to remove tech debt and reduce toil. Dedicate time every quarter to automation that reduces toil and enables development teams to do their own operations. Get used to saying "no" intelligently; your backlog is explanation enough for anyone who gets huffy that you won't do something out of the blue immediately.


👤 bob1029
> We run into very interesting problems due to scale (billions of requests per month for our main public apis) and the large amount of data we deal with.

So, if you are handling 10 billion requests per month, that would average out to about 4k per second.

Are these API calls data/compute intensive, or is this more pedestrian data like logging or telemetry?

Any time I see someone having a rough time with a distributed system, I ask myself if that system had to be distributed in the first place. There is usually a valuable lesson to be learned by probing this question.


👤 jacquesm
That question probably needs more information.

But your 'average engineer' is probably better served by asking themselves the question whether the system really needed to be that large and distributed rather than if working on them is exhausting. The vast bulk of the websites out there doesn't need that kind of overkill architecture, typically the non-scalable parts of the business preclude needing such a thing to begin with. If the work is exhausting that sounds like a mismatch between architecture choice and size of the workforce responsible for it.

If you're an average (or even sub average) engineer in a mid sized company stick to what you know best and how to make that work to your advantage, KISS. A well tuned non-distributed system with sane platform choices will outperform a distributed system put together by average engineers any day of the week, and will be easier to maintain and operate.


👤 softwarebeware
I find it "exhilirating," not "exhausting." But I also don't think that "...your average engineer should now be able to handle all this." That is where we went completely wrong as an industry. It used to be said that what we work on is complex, and you can either improve your tools or you can improve your people. I've always held that you will have to improve your people. But clever marketing of "the cloud" has held out the false promise that anyone can do it.

Lies, lies, and damn lies, I say!

Unless you have bright and experienced people at the top of a large distributed systems company, who have actually studied and built distributed systems at scale, your experience of working in such a company is going to suck, plain and simple. The only cure is a strong continuous learning culture, with experienced people around to guide and improve the others.


👤 ublaze
Yeah, large-scale systems are often boring in my experience, because the scale limits what features you can add to make things better. Each and every decision has to take scale into account, and it's tricky to try experimenting.

I think it has to do with the kind of engineer you are. Some engineers love iterating and improving such systems to be more efficient, more scalable, etc. But it can be limiting due to the slower release cycles, hyper focus on availability, and other necessary constraints.


👤 guilhas
Recently I was asked to work on a older project for enterprise customers. And we are always weary of working on old unmaintained code

But it just felt like a breath of fresh air

All code in same repository, UI, back-end, SQL, MVC style Fast from feature request to deliver in production. Changes, test, fix bugs, deploy. We were happy and the customers too

No cloud apps, buckets, secrets, no oauth, little configuration, no docker, no micro services, no proxies, no CICD. It does look somewhere along the way we overcomplicated things


👤 benlivengood
Google's SRE books cover a lot of the things that large teams managing large distributed systems encounter and how to tackle it in a way that doesn't burn out engineers. Depending on organization size/spread, follow-the-sun oncall schedules drastically reduce burnout and apprehension about outages. Incident management procedures give confidence when outages do happen. Blameless postmortems provide a pathway to understanding and fixing the root causes of troublesome outages. Automation reduces manual toil. Google SRE has been keeping a lot of things running for a decade or more and has learned a lot of lessons. I did that from 2014 to 2018 and it seemed like a pretty mature organizational approach, and the books document essentially that era.

👤 chubot
My take is that it's exhausting because everything is so damn SLOW.

"Back to the 70's with Serverless" is a good read:

https://news.ycombinator.com/item?id=25482410

The cloud basically has the productivity of a mainframe, not a workstation or PC. It's big and clunky.

----

I quote it in my own blog post on distributed systems

http://www.oilshell.org/blog/2021/07/blog-backlog-2.html

https://news.ycombinator.com/item?id=27903720 - Kubernetes is Our Generation's Multics

Basically I want basic shell-like productivity -- not even an IDE, just reasonable iteration times.

At Google I saw the issue where teams would build more and more abstraction and concepts without GUARANTEES. So basically you still have to debug the system with shell. It's a big tower of leaky abstractions. (One example is that I had to turn up a service in every data center at Google, and I did it with shell invoking low level tools, not the abstractions provided)

Compare that with the abstraction of a C compiler or Python, where you rarely have to dip under the hood.

IMO Borg is not a great abstraction, and Kubernetes is even worse. And that doesn't mean I think something better exists right now! We don't have many design data points, and we're still learning from our mistakes.

----

Maybe a bigger issue is incoherent software architectures. In particular, disagreements on where authoritative state is, and a lot of incorrect caches that paper over issues. If everything works 99.9% of the time, well multiple those probabilities together, and you end up with a system that requires A LOT of manual work to keep running.

So I think the cloud has to be more principled about state and correctness in order not to be so exhausting.

If you ask engineers working on a big distributed system where the authoritative state in their system is stored, then I think you will get a lot of different answers...


👤 seanwilson
It's okay to prefer working on small single server systems with small teams for example. I do this while contracting quite often and enjoy how much control you get to make big changes with minimal bureaucracy.

Sometimes it feels like everyone is focused on eventually working with Google scale systems and following best practices that are more relevant towards that scale but you can pick your own path.


👤 rhacker
Humans GET simplicity from extreme hyper complexity.

Take a gas generator. Easy, add oil and gas and get electricity and these days they even come in a smoothed over plastic shell that makes it look like a toy. Inside, very complex, spark plugs, engine, coils, inverter. A hundred years of inventions packed into a 1.5' x 1.5' box.

It's the same thing for complicated systems. Front end to back. No matter how ugly or how much you wish it was refactored - some exec knows it as a box where you put something in and magical inference comes out. Maybe that box actually causes real change in the physical world - like billions of packages being sent out all over the world.

In the days of castles you would have similar systems managed by people. People that drag wooden carts of shit out of a castle. Carrying water around. Manually husking corn and wheat and what have you.

No matter how far into the future we go, we will continue to get simple out of monstrous complexity.

That's not the answer to your question - but it's just that the world will always lean towards going that way.


👤 lumost
Handling scale is a technically challenging problem, if you enjoy it - then take advantage! however sometimes taking a break to work on something else can be more satisfying.

Typically on a "High scale" service spanning hundreds or thousands of servers you'll have to deal with problems like. "How much memory does this object consume?", "how many ms will adding this regex/class into the critical path use?", "We need to add new integ/load/unit tests for X to prevent outage Y from recurring", and "I wish I could try new technique Y, but I have 90% of my time occupied on upkeep".

It can be immensely satisfying to flip to a low/scale, low/ops problem space and find that you can actually bang out 10x the features/impact when you're not held back by scale.

Source: Worked on stateful services handling 10 Million TPS, took a break to work on internal analytics tools and production ML modeling, transitioning back to high scale services shortly.


👤 karmakaze
I'm trying to relate this to my experiences. The best I can make of it is that burnout comes from dealing with either the same types of problems, or new problems at a rate that's higher than old problems get resolved.

I've been in those situations. My solution was to ensure that there was enough effort into systematically resolving long-known issues in a way that not only solves them but also reduces the number of new similar issues. If the strategy is instead to perform predominantly firefighting with 'no capacity' available for working on longer term solutions there is no end in sight unless/until you lose users or requests.

I am curious what the split is of problems being related to:

1. error rates, how many 9s per end-user-action, and per service endpoint

2. performance, request (and per-user-action) latency

3. incorrect responses, bugs/bad-data

4. incorrect responses, stale-data

5. any other categories

Another strategy that worked well was not to fix the problems reported but instead fix the problems known. This is like the physicist looking for keys under the streetlamp instead of where they were dropped. Tracing a bug report to a root cause and then fixing it is very time consuming. This of course needs to continue, but if sufficient effort it put to resolving known issues, such as latency or error rates of key endpoints, it can have an overall lifting effect reducing problems in general.

A specific example was how effort into performance was toward average latency for the most frequently used endpoints. I changed the effort instead to reduce the p99 latency of the worst offenders. This made the system more reliable in general and paid off in a trend to fewer problem reports, though it's not easy/possible to directly relate one to the other.


👤 smoyer
Using micro-services instead of monoliths is a great way for software engineers to reduce the complexities of their code. Unfortunately, it moves the complexity to operations. In an organization with a DevOps culture, the software engineers still share responsibility for resolving issues that occur between their micro-service and others.

In other organizations, individual teams have ICDs and SLAs for one or more micro-services and can therefore state they're meeting their interface requirements as well as capacity/uptime requirements. In these organizations, when a system problem occurs, someone who's less familiar with the internals of these services will have to debug complex interactions. In my experience, once the root-cause is identified, there will be one or more teams who get updated requirements - why not make them stakeholders at the system-level and expedite the process?


👤 Too
One problem I frequently see with distributed systems is not the amount of services and the distributed nature per se.

Rather that it allows, and tempts, you to use the perfect tool for each job. Leading to a lot of variations in your stack.

Suddenly you have 5 different databases, 3 RPC protocols, 4 programming languages and 2 operating systems spinning around in your cluster. Only half of them connected to your single sign on. And don’t forget about all the cloud dependencies.

If any one of them starts misbehaving you have to read up “how did I attach debugger to Java process again”. How do I even log in to a mongodb shell? I installed pgadmin last week.

Standardize your stack and accept that some times it might mean using something slightly inefficient in the small scheme. In the big scheme it will make things more homogenous, unified and simpler for operators.


👤 glintik
The most undervalued thing that forgot even highly skilled engineers - KISS principle. That’s why you are burning out supporting such systems.

👤 avensec
Yes, but in a different way. I work in Quality Engineering, and the scope of maturity in testing distributed systems has been exhausting.

Reading other comments from the thread, I see similar frustrations from teams I partner with. How to employ patterns like contact, hypothesis, doubles, or shape/data systems (etc.) typically gets conflated with System testing. Teams often disagree on the boundaries of the system start leaning towards System Testing, and end up adding additional complexity in tests that could be avoided.

My thought is that I see the desire to control more scope presenting itself in test. I typically find myself doing some bounded context exercises to try to hone in on scope early.


👤 asim
Yup. Spent more than a decade doing it. Got so frustrated that I started a company to try abstract it all away for everyone else. It's called M3O https://m3o.com. Everyone ends up building the same thing over and over. A platform with APIs either built in house or an integration to external public APIs. If we reuse code, why not APIs.

I should say, I've been a sysadmin, SRE, software engineer, open source creator, maintainer, founder and CEO. Worked at Google, bootstrapped startups, VC funded companies, etc. My general feeling, the cloud is too complex and I'm tired on waiting for others to fix it.


👤 macksd
Mental / emotional burnout is certainly not uncommon in tech (probably in most other careers, I'd bet). Most people in Silicon Valley are changing jobs more often than 4-5 years. I don't like to constantly be the new guy, but there is a refreshing feeling to starting on something new and not carrying years of technical debt on your emotions. Maybe it's time to try something new, take a bigger vacation than usual, or talk to someone about new approaches you can try in your professional or personal life. But certainly don't let the fact that you feel like this add to the load - you're not alone, and it's not permanent.

👤 eez0
I find it actually the other way around.

As you said, a benefit of large distributed systems is that usually its a shared responsibility, with different teams owning different services.

The exhaustion comes into place when those services are not really independent, or when the responsibility is not really shared, which in turn is just a worse version of a typical system maintained by sysadmins.

One thing that helps is bring the DevOps culture into the company, but the right way. It's not just about "oh cool we are now agile and deploy a few times a day", it's all down to shared responsibility.


👤 kortex
It definitely can be. I'm constantly trying to push our stack away from anti-patterns and towards patterns that work well, are robust, and reduce cognitive load.

It starts by watching Simple Made Easy by Rich Hickey. And then making every member of your team watch it. Seriously, it is the most important talk in software engineering.

https://www.infoq.com/presentations/Simple-Made-Easy/

Exhausting patterns:

- Mutable shared state

- distributed state

- distributed, mutable, shared state ;)

- opaque state

- nebulosity, soft boundaries

- dynamicism

- deep inheritance, big objects, wide interfaces

- objects/functions which mix IO/state with complex logic

- code than needs creds/secrets/config/state/AWS just to run tests

- CI/CD deploy systems that don't actually tell you if they successfully deployed or not. I've had AWS task deploys that time out but actually worked, and ones that seemingly take, but destabilize the system.

---

Things that help me stay sane(r):

- pure functions

- declarative APIs/datatypes

- "hexagonal architecture" - stateful shell, functional core

- type systems, linting, autoformatting, autocomplete, a good IDE

- code does primarily either IO, state management, or logic, but minimal of the other ops

- push for unit tests over integration/system tests wherever possible

- dependency injection

- ability to run as much of the stack locally (in docker-compose) as possible

- infrastructure-as-code (terraform as much as possible)

- observability, telemetry, tracing, metrics, structured logs

- immutable event streams and reducers (vs mutable tables)

- make sure your team takes time periodically to refactor, design deliberately, and pay down tech debt.


👤 unnouinceput
I wrote such a system. 6+ years, between end of '07 to beginning of '14. It grew organically, with more and more end points as time went by, and when I exited the project it had over 250 end points, each having hundreds of thousand of users requests per day. By your measurement, that would mean the system I wrote would've handled in a month a total of 250 (end points) x 30 (days) x ~400k (requests per day) == 3B user requests in a month.

To my knowledge the system is still used to this day and I think it grew 10x meanwhile, so I think it's serving over 30B requests each month.

That being said, to answer your question - Yes! I got tired of it, started to plateau and felt I was lagging behind in terms of keeping up with technology around me. So I exited but at the same time I also started to get involved in other projects as well. So in the end I was overworked and I ditched the biggest project of my entire career as freelancer because payment was not worth anymore. I wanted to feel excited and the additional projects eventually made up in terms of money, but boy oh boy! the variation is what made me not feeling burnout. Nowadays if I feel another project is going that route I discuss with client to replace me with a team once I deliver the project in a stable state and for horizontal scaling.


👤 bofaGuy
Worked on a team at BofA, our application would handle 800 million events per day. The logic we had for retry and failure was solid. We also had redundancy across multiple DCs. I think we processed like 99.9999999% of all events successfully. (Basically all of them, last year we lost about 2,000 events total) I didn’t find it very stressful at all. We build in JMX Utica for our production support teams be able to handle practically anything they would need to.

👤 ChrisMarshallNY
TLDR; Yes, it is exhausting, but I have found ways to mitigate it.

I don't develop stuff that runs billions of queries. More like thousands.

It is, however, important infrastructure, on which thousands of people around the world, rely, and, in some cases, it's not hyperbole to say that lives depend on its integrity and uptime.

One fairly unique feature of my work, is that it's almost all "hand-crafted." I generally avoid relying on dependencies out of my direct control. I tend to be the dependency, on which other people rely. This has earned me quite a few sneers.

I have issues...

These days, I like to confine myself to frontend work, and avoid working on my server code, as monkeying with it is always stressful.

My general posture is to do the highest Quality work possible; way beyond "good enough," so that I don't have to go back and clean up my mess. That seems to have worked fairly well for me, in at least the last fifteen years, or so. Also, I document the living bejeezus[0] out of my work, so, when I inevitably have to go back and tweak or fix, in six months, I can find my way around.

[0] https://littlegreenviper.com/miscellany/leaving-a-legacy/


👤 phuff
I think there are a lot of strategies for dealing with the kinds of issues you're working with, but a lot of them involve building a good engineering culture and building a disciplined engineering practice that can adapt and find best scalability practices at that level.

We do billions of requests a day on one of the teams that I manage at work, and that team alone has sole operational and development responsibility for a large number of subsystems to be able to manage the complexity that a sustained QPS of that level requires. But those subsystems are in turn dependent on a whole suite of other subsystems which other teams own and maintain.

It requires a lot of coordination with a spirit of good-will and trust among the parties in order to be able to develop the organizational discipline and rigor needed to be able to handle those kinds of loads without things falling over terrible all the time and everybody pointing fingers at each other.

But! There are lots of great people out there who have spent a lot of time figuring out how to do these things properly and that have come up with general principals that can be applied in your specific circumstances (whatever they may be). And when executed properly I would argue that these principals can be used to mitigate the burnout you're talking about. It's possible to make it through those rough spots in an organization (that frequently, though not always, come from quick business scaling -- i.e. we grew from 1000 customers to 10,000 last year) etc.

If you're feeling this kind of feeling and the organization isn't taking steps to work on it, then there are things you can do as an IC to help, too. But this is all a much longer conversation :)


👤 hughrr
Yes it’s horrible. I actually miss the early 00’s when I did infra and code for small web design agencies. I actually could complete work back then.

👤 nixgeek
Quite the opposite, interestingly, I’m usually in “Platform”-ish roles which touch or influence all aspects of the business, inc. building and operating services which do a couple orders of magnitude more than OP’s referenced scale (in the $job[current] case, O(100B - 1T) requests per day) and while I agree with the “Upside” (career progression, intellectual interest, caliber of people you work with), I haven’t experienced the burnout and in 2022 am actually the most energized I’ve been in a few years.

I expect you can hit burnout building services and systems for any scale and that’s more reflective on the local environment — the job and the day to day, people you work with, formalized progression and career development conversations, the attitude to taking time off and decompressing, attitudes to oncall, compensation, other facets.

That said, mental health and well-being is real and IMO needs to be taken very seriously, if you’re feeling burnout, figuring out why and fixing that is critical. There have been too many tragedies both during COVID and before :-(


👤 gorgoiler
My number one requirement for a distributed system is that the code all be one place.

There are good reasons for wanting multiple services talking through APIs. Perhaps you have a Linux scheduler that is marshalling test suites running on Android, Windows, macOS and iOS?

If all these systems originate from a single repository, preferably with the top level written in a dynamic language that runs from its own source code, then life can be much easier. Being able to change multiple parts of the infrastructure in a single commit is a powerful proposition.

You also stand a chance of being able to model your distributed system locally, maybe even in a single Python process, which can help when you want to test new infrastructure ideas without needing the whole distributed environment.

Your development velocity will be faster and less painless. Changes being slow and painful are what burn people out and grind progress to a halt.


👤 daneel_w
I find it very draining and vexing to work on systems that have all of its components distributed left and right without clear boundaries, instead of being more coalesced. Distribution in the typical sense - identical spares working in parallel for the sake of redundancy - doesn't faze me very much.

👤 ebbp
It’d be interesting to know - what are the expectations made of you? In this environment, I’d expect there to be dedicated support for teams operating their services - i.e. SRE/DevOps/Platform teams who should be looking to abstract away some of the raw edges of operating at scale.

That said, I do think there’s a psychological overhead when working on something that serves high levels of production traffic. The stakes are higher (or at least, they feel that way), which can affect different people in different ways. I definitely recognise your feeling of exhaustion, but I wonder if it maybe comes from a lack of feeling “safe” when you deploy - either from insufficient automated testing or something else.

(For context - I’m an SRE who has worked in quite a few places exactly like this)


👤 dudul
Let's set aside the "distributed" aspect. To effectively scale a team and a code base you need some concept of "modularization" and "ownership". It is unrealistic to expect engineers to know everything about the entire system.

The problem is that this division of the code base is really hard. It is really hard to find the time and the energy to properly section your code base in proper domains and APIs. Especially with the constantly moving target of what needs to be delivered next. Even in a monorepo it is exhausting.

Now, put on top of that the added burden brought by a distributed system (deployment, protocol, network issues, etc) and you have something that becomes even more taxing on your energy.


👤 axegon_
Depends. Not the systems themselves but more the scope of the work and how it is being done. If the field is boring or the design itself is bad(with no ability to make it better, whether it's simply by design, code quality or whatever), my motivation, will and desire to work teleports to a different dimension-it's a fine line between exhaustion and frustration I guess. If it is something interesting, I can work on it for days straight without sleeping. Lately I've been working on a personal project and every time I have to do anything else I feel depressed for having to set it aside.

👤 alecbz
Can you say more? What specifically is exhausting?

Exhaustion/burnout isn't uncommon but without more context it's hard to say if it's a product of the type of work or your specific work environment.


👤 faangiq
Yes the complexity and scale of these systems is far beyond what companies understand. The salaries of engineers on these systems need to double asap or they risk collapse.

👤 yodsanklai
This post resonates with me. I recently join a big organisation and a team owning such a system. The oncalls are very stressful to me. Our systems aren't that robust and we don't have control on all the dependencies. So things fail all the time. At the same time, management is consistently pushing for new features. As a consequence, work life balance is bad, turnover is high.

My hope is that I'll learn to manage the stress and gain more expertise.


👤 anarazel
Is it really the distributed aspect? Or "just" working on a above average complicated project for many years?

The consequences of bugs in many distributed systems (and several other types of systems) are IME often harder to bear than e.g. UI or frontend workflow bugs. It's hard to have caused data loss. And at some point you probably will, even if you're quite careful.

Maybe I'm just projecting...


👤 jeffrallen
Yes, it's part of why I'm a dad at home who works on a little bash scripting sysadmin work as a side job.

Everything has gotten too complicated and slow.


👤 kodah
If you're working on distributed systems scheduling and orchestration, then yeah it's exhausting. I did it for six years as a SRE-SE and am now back to being a SWE on a product team. If you like infrastructure stuff without having responsibility for the whole system the way that scheduling and orchestration makes you, then look at working on an infrastructure product.

👤 Arrezz
I think our field is so broad that it is somewhat nebolous to talk about the average engineer. But from my experience taking car of such a large system with a large amount of requests and complexity is outside of what is expected of an average engineer. I think that there is an eventual limit for how much complexity a single engineer can handle for several years.

👤 asdfman123
Relevant comedy video:

https://www.youtube.com/watch?v=y8OnoxKotPQ

This recent video they put out is pretty good, too:

https://www.youtube.com/watch?v=kHW58D-_O64


👤 systematical
I have 15 years xp in dev but all of that was in smaller projects and a small team. I recently took a gig in bigger org with a distributed system and on call etc. It's exhausting and information overload. I'll give myself more time to acclimate but if I feel like this still after a year I'm out.

👤 primeletter
I can see how it'd be exhausting to have to deal with the responsibility for the entirety of a few services.

A key part of scaling at an org-level is continuously simplifying systems.

At a certain level of maturity, it's common for companies to introduce a horizontal infra team (that may or may not be embedded in each vertical team).


👤 Simon_O_Rourke
It's not so much the systems, but the organizations which create systems in their own image so to speak. If making changes is hard, either in the organization or within teams, you better believe any changes to a distributed system will be equally tough to implement.

👤 bravetraveler
I did at first, but then learning config management and taking smaller bites helped.

I started out as a systems administrator and it's evolved into doing that more and faster. The tooling helps me get there, but I did have to learn how to give better estimates.


👤 SNosTrAnDbLe
I actually love it and the more complex the system the better. I have been doing it for more than 10 years now and everyday I learn something new from the legacy and the replacement that we work on

👤 tonto
I don't really work on distributed systems but I do often worry about performance and reliability and even if I get some wins sometimes the anxiety of not performing right is stressful....

👤 lr4444lr
Yes. But remember, with tools and automation getting better, this is a major source of value add that you bring as a software engineer which is likely to have long term career viability.

👤 tristor
I think I understand what you mean, but it’s hard for me to contextualize, because I’m still working through some of my own past to identify where some of my burn out began.

For my part, I love working at global scale on highly distributed systems, and find deep enjoyment in diving into the complexity that brings with it. What I didn’t enjoy was dealing with unrealistic expectations from management, mostly management outside my chain, for what the operations team I led should be responsible for. This culminated in an incident I won’t detail, but suffice to say I hadn’t left the office in more than 72 hours continuous, and the aftermath was I stopped giving a shit about what anyone other than my direct supervisor and my team thought about my work.

It’s not limited to operations or large systems, but every /job/ dissatisfaction I’ve had has been in retrospect caused by a disconnect between what I’m being held accountable for vs what I have control over. As long as I have control over what I’m responsible for, the complexity of the technology is a cakewalk in comparison to dealing with the people in the organization.

Now I’ve since switched careers to PM and I’ve literally taken on the role of doing things and being held responsible for things I have no control over and getting them done through influencing people rather than via direct effort. Pretty much the exact thing that made my life hell as an engineer is now my primary job.

Making that change made me realize a few things that helped actually ease my burn out and excite me again. Firstly, the system mostly reflects the organization rather than the organization reflecting the system. Secondly, the entire cultural balance in an organization is different for engineers vs managers, which has far-reaching consequences for WLB, QoL, and generally the quality of work. Finally, I realized that if you express yourself well you can set boundaries in any healthy organization which allows you to exert a sliding scale of control vs responsibility which is reasonable.

My #1 recommendation for you OP is to take all of your PTO yearly, and if you find work intruding into your time off realize you’re not part of a healthy organization and leave for greener pastures. Along the way, start taking therapy because it’s important to talk through this stuff and it’s really hard to find people who can understand your emotional context who aren’t mired in the same situation. Most engineers working on large scale systems I know are borderline alcoholics (myself too back then), and that’s not a healthy or sustainable coping strategy. Therapy can be massively helpful, including in empowering you to quit your job and go elsewhere.


👤 z3t4
Often when I hear stories of billions of requests per second it's self inflicted because of over complicated architecture where all those requests are generated only by a few thousand customers... So it's usually a question of how the company operate, do you constantly fight fires ? or do you spend your time implementing stuff that have high value for the company and it's customers ? Fighting fires can get your burned out (no pun intended) while feeling that you deliver a lot of value will make you feel great.


👤 ok123456
Yes. That's why you avoid building them unless you absolutely need to, and instead build libraries instead.

👤 revskill
Yes, a bit. But it's fun. And the motivation of fun is hardly to find in a big monothlic system.

👤 dekhn
it's exhausting but can be fun if you have a competent team to support you. I like nothing more than being told "one TPU chip in this data center is bad. Find it efficiently at priority 0."

👤 shetill
I find any work exhausting

👤 helsinki
I find working on single services / components more exhausting.

👤 a_code
You are right, I work for a FAANG on one such system and it’s hard.

👤 the_gipsy
If you're burnt out, you're most likely being suckered.

👤 arielweisberg
Not so all? Stuff is usually fixable.

Org and people are not.


👤 qaq
lets say for the argument sake it's 50 billion thats 20k/sec there is zero need to for a fancy setup at this scale

👤 jsiaajdsdaa
It's only exhausting when you know deep in your heart that this could run on one t2 large box.

👤 timka
I think it's more likely Zeitgeist. You see, someone else finds working in Data Science frustrating, another person nearing his 40 says he's anxious about his career, another guy says he's worried about it's too late to do something about the big tech messing up the field etc.

I've had similar issues recently working at a demanding position I didn't really like even though my achievements may look impressive in my resume. I tried working in a shop somewhere in between aerospace and academia but just didn't fit at all. I ended up joining a small team that I enjoy working with so far and feel much better now.

At a higher level, we're hitting the limits of current paradigm in many ways including monetary system (debt), environment (pollution) and natural resources, ideology (creativity and innovation), technology (complexity).

The good news is that this year current monetary system will cease to exist. This will eventually restructure the economy to a more healthy balance. Unfortunately, this will have severe social consequences as standard of living will change dramatically (somewhere at the 60's level). This will basically destroy the middle class and thus change the structure of consumption. Obviously, this will mostly affect services and other non-essential stuff we got used to. On the other hand, this will blow down all bloat like insane market cap of the big tech etc. That is working in IT may become fun again, like 20 years back :)