HACKER Q&A
📣 herodoturtle

Where is the AWS outage post-mortem?


I've been checking in on this page [0] frequently since the recent us-east-1 outage.

The outage was significant and had many unexpected knock-on effects for us - despite most of our instances being in a totally separate region - so just wondering what happened.

Any idea how long these post-mortems typically take to be published?

Thanks.

[0] https://aws.amazon.com/premiumsupport/technology/pes/


  👤 ptcrash Accepted Answer ✓
Let's try to remember the guidelines, be kind, and have curious conversation. Hacker News isn't the place for baseless speculation and generalization.

AWS publishes postmortems for major outages here: [1]

This outage was significant enough to expect a public PES but it's typical for it to take at least a few weeks for that page to get updated. At least, that's been the trend for all previous publications. If the PES is anything like previous PESes, it will have a detailed explanation of the root-cause and an explanation of what will change to prevent the issue from happening again but it will still be technologically abstract because cloud providers are very secretive of how they orchestrate resources behind the scenes. Enterprise-tier support customers can ask for RCA's but as I understand it, the account managers don't have any official wording internally yet either. But, it's only been 48 hours, so something of this significance will likely have a large chain of sign-offs it has to go through before official wording is announced.

This type of quietness from AWS before their official wording gets published is standard practice for them. They have a large legal team, PR team, and executive team that will all be interested in controlling the narrative but that's not uncommon for other large companies either.

If I were to take a stab in the dark, I'd say we'll see a PES in ~2 weeks. Maybe sooner, maybe later. If they don't announce a PES, I'll be really shocked because last year's Kinesis outage had arguably smaller impact but ended up getting a publication. [2]

[1] https://aws.amazon.com/premiumsupport/technology/pes/

[2] https://aws.amazon.com/message/11201/

Edit: Typos


👤 belter
Also waiting to know more details. In the meanwhile...You might want to check these:

"AWS uncertain about true cause of outage" https://news.ycombinator.com/item?id=29492120

"Thousand Eyes AWS Outage Analysis: December 7, 2021 " https://news.ycombinator.com/item?id=29483487

Edit: Using [0] linked above come up with the following statistics...

[0] https://aws.amazon.com/premiumsupport/technology/pes/

15 major events of last 10 years, including the one from 7 Dec.

Percentage distribution across months of the year:

----------------------------------------------

Jan: 0% Apr: 7% Jul: 7% Oct: 7%

Feb: 7% May: 0% Aug: 13% Nov: 13%

Mar: 0% Jun: 13% Sep: 13% Dec: 20%

-----------------------------------------------

Book your vacation for Jan, Mar, and May... :-)


👤 exabrial
Pro tip: Never mark your services down on the status page and you don't have to write an RCA!

👤 thecrumb
Takes awhile to get the creative fictional juices flowing.

👤 stunt
It's not that simple. Incident responders and teams that own the pieces have to write the initial draft and review it. Perhaps two different copies. One with more details for internal and another for public. Multiple round of reviews, and then it goes to a separate team to proofread the public material. Meanwhile they also have to measure customer impact and sometimes even inform high tier customers about the incident and its impact.

👤 cyounkins
Conveniently most of AWS' published postmortems do not have publish dates. Some are for the same day, some say "earlier this week".

Realistically, it takes time to fully dig into what happened. In a public postmortem you also want to describe your mitigation efforts, which you also need to think through fully.

I expect something in the next few days.


👤 brycewray
In this case, probably around 4:55 PM Eastern Time on Friday, December 24.

👤 MrWiffles
Just to chime in here and maybe help explain some of the more snarky responses...

If I were in your shoes, I wouldn't expect one, and if one ever does get published, be suspicious of it. VERY suspicious. Because it's likely missing critical information and/or some portion is likely fictitious to some degree. I mean, this is a company that can't even update its own system status dashboard truthfully, so don't expect any degree of honesty or accountability here.


👤 dandigangi
Probably within a month. Maybe a small post about it while they get things together more seriously.

👤 1cvmask
Cat got stuck in the server room.

👤 Bluecobra
My guess is that the the person that broke Facebook for a day got hired at Amazon...

👤 gavinegolden

👤 imwillofficial
The COE is in progress.

👤 nix23
Nothing happened you can go away now.

👤 gukov
Public post mortems are usually done by growing companies that have something to lose. AWS is too big to “lose” anything.

👤 la6471
It takes 5 business days.