HACKER Q&A
📣 rozenmd

What does your team do during a cloud outage?


If you're using a cloud provider (AWS/GCP/Azure etc), and there's an outage (such as AWS's us-east-1 incident back in December), how does your team respond?


  👤 AdamJacobMuller Accepted Answer ✓
Assuming the cloud is 100% hard down and there's nothing you can do to mitigate the primary issue which is not THAT common...

Unless you're 100% in a single cloud in a single region, there's always going to be something which broke which had a dependency on something in the cloud and you can work to mitigate that.

Events like this create increased workload on external and internal customer service teams, people who would normally be engineering can pitch in there. Internally especially hearing from senior dev/ops people helps people.

Shit-post memes about how there's nothing you can do because the cloud is down.

If the outage is on AWS, make up plans for how you should move the infrastructure off AWS onto GCP because AWS has so many outages. If the outage is on GCP, make up plans for how you should move the infrastructure off GCP onto AWS because GCP has so many outages.

Make up plans about how you can scale to multi-cloud muti-region deployments to mitigate against the next cloud outage. Once you finish doing the math realize that you're actually making things more brittle and thus causing more outages and that nobody wants to pay to avoid the hour or three of downtime per year. Realize that the developer who pushed the bad ALTER TABLE statement into production three weeks ago caused a bigger outage than the cloud ever will and begin to question your life choices.

Play some games with the team where you try to guess what sites/services are impacted. Loser has to stay late to close all the tickets generated when cloud comes back up.

Shit-post more memes.


👤 bradknowles
Disclaimer: I work at AWS, but not in any of the groups that build and maintain any of the infrastructure components. My team is on the B2B applications side of the house.

We can be just as affected by an AWS infrastructure outage as anyone. However, we also have the possibility of seeing some of the things that are going on internally to fix the problems. So, we can at least feel better about not being able to do our work for the day.

I can tell you this — being multi-region is a lot harder to get right than anyone gives it credit for. Even just plain multi-AZ is harder than most people give it credit for. There’s a lot of stuff going on under the hood. And there’s all sorts of unintentional dependencies between systems that you didn’t realize until things broke just the right way.

Trying to do multi-cloud? I wouldn’t wish that on my worst enemy.

The way cloud-scale works, there’s always something weird going on somewhere. Always.

A well-architected system will try to be rugged and resilient to those failures, but there’s always going to be a limit to the kinds of failure modes you can predict and prepare for.

We saw that when I was the Sr. Internet Mail Administrator at AOL from 95-97, we see this today in our group at AWS, and my friends at Google have reported the same.


👤 yuppie_scum
Can only provide communication updates to stakeholders and prepare explanations for why we can’t just magically go multi cloud or lift and shift everything to a different region

👤 lightlyused
Enter tickets in the backlog and wait for the higher ups to forget about them and then do nothing because we aren't given the people power to actually do all the work.