HACKER Q&A
📣 jeffe

What kinds of engineers handle global cloud shortages?


Reading through incident reports of major cloud outages is very informative. Both the scale and the speed at which engineers root cause, implement fixes, and safely deploy is really impressive. I thought about it and realized even the senior engineers I know don't handle that kind of scale. I was curious if anyone knows any of these people/beings (or are them) and could speak to what they are like in terms of professional experience and interactions, maybe anecdotes of wisdom?


  👤 EricBurnett Accepted Answer ✓
Googler here. These people are smart and usually experienced, but don't put them on a pedestal. A lot goes into it - whole teams building the infrastructure with reliability as a core feature, and many hours spent running down all possible single sources of failure (so it usually takes a couple issues together to combine into a big incident). Good monitoring platforms, release systems, etc. Training and practice at incident handling, with many more folk only a page away.

https://sre.google/sre-book/table-of-contents/ is a good source to start with.

And ultimately, the perhaps most critical distinction - opportunity. You only work on a huge distributed system when there's enough customer demand for it to need to exist, and every large system started as a smaller system. That scaling of demand scales importance, which scales the effort invested in its reliability/scalability/efficiency/etc. You can read great stories about the many failures at Google, or Twitter (remember the Fail Whale?), or any other large company. The maturity you see now was developed over time, and any newly hired engineers will be trained into the culture of maintaining and improving it further. With few exceptions, the folk that incepted the big systems back when they were small aren't the ones scaling them out today anyways - it's a very teachable skill, given the need and opportunity.


👤 viraptor
There's no magic here. It's just people experienced with that specific product / tech. Once you get large enough scale you get more specialisation, so you do get a person who can deal with debugging X on the fly better than others.

Senior engineers are ok for that.