The outage was significant and had many unexpected knock-on effects for us - despite most of our instances being in a totally separate region - so just wondering what happened.
Any idea how long these post-mortems typically take to be published?
Thanks.
[0] https://aws.amazon.com/premiumsupport/technology/pes/
AWS publishes postmortems for major outages here: [1]
This outage was significant enough to expect a public PES but it's typical for it to take at least a few weeks for that page to get updated. At least, that's been the trend for all previous publications. If the PES is anything like previous PESes, it will have a detailed explanation of the root-cause and an explanation of what will change to prevent the issue from happening again but it will still be technologically abstract because cloud providers are very secretive of how they orchestrate resources behind the scenes. Enterprise-tier support customers can ask for RCA's but as I understand it, the account managers don't have any official wording internally yet either. But, it's only been 48 hours, so something of this significance will likely have a large chain of sign-offs it has to go through before official wording is announced.
This type of quietness from AWS before their official wording gets published is standard practice for them. They have a large legal team, PR team, and executive team that will all be interested in controlling the narrative but that's not uncommon for other large companies either.
If I were to take a stab in the dark, I'd say we'll see a PES in ~2 weeks. Maybe sooner, maybe later. If they don't announce a PES, I'll be really shocked because last year's Kinesis outage had arguably smaller impact but ended up getting a publication. [2]
[1] https://aws.amazon.com/premiumsupport/technology/pes/
[2] https://aws.amazon.com/message/11201/
Edit: Typos
"AWS uncertain about true cause of outage" https://news.ycombinator.com/item?id=29492120
"Thousand Eyes AWS Outage Analysis: December 7, 2021 " https://news.ycombinator.com/item?id=29483487
Edit: Using [0] linked above come up with the following statistics...
[0] https://aws.amazon.com/premiumsupport/technology/pes/
15 major events of last 10 years, including the one from 7 Dec.
Percentage distribution across months of the year:
----------------------------------------------
Jan: 0% Apr: 7% Jul: 7% Oct: 7%
Feb: 7% May: 0% Aug: 13% Nov: 13%
Mar: 0% Jun: 13% Sep: 13% Dec: 20%
-----------------------------------------------
Book your vacation for Jan, Mar, and May... :-)
Realistically, it takes time to fully dig into what happened. In a public postmortem you also want to describe your mitigation efforts, which you also need to think through fully.
I expect something in the next few days.
If I were in your shoes, I wouldn't expect one, and if one ever does get published, be suspicious of it. VERY suspicious. Because it's likely missing critical information and/or some portion is likely fictitious to some degree. I mean, this is a company that can't even update its own system status dashboard truthfully, so don't expect any degree of honesty or accountability here.