What is the best postmortem you've seen?

Question

It seems like there are a lot of examples of companies handling a security breach or loss of service poorly. Are there examples when a company handled an incident well, especially with a great postmortem writeup?

yodon · Accepted Answer

Edward Tufte's analysis of the Space Shuttle Columbia explosion[0] is by far the most informative post mortem I've seen. It directly impacted everything I've written since reading it.
If you hit the link, you'll see the page appears to be a wall of text, not a simple slide or two. As you read deeper into the report, you'll understand that's an intentional aspect of the report. (I'll also note this is the Columbia explosion, not the better known Challenger disaster O-ring post-mortem discussed by Richard Feynman in his autobiography[1], even though that's a great post mortem as well).
[0]https://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=...
[1]https://www.amazon.com/What-Care-Other-People-Think/dp/03933...

geuis · Answer

An old colleague of mine worked pretty extensively with a backend engineer on a script to do some data migration on user accounts. The other engineer did most of the work on the script itself. Mind you, the company has millions of users and didn't use staging databases. (No idea if this is still true.)
Come D-day, my colleague runs the script on a limited group of users (a 100k or so) to validate. I forget the details, but something in the script was incorrect and ended up breaking some features for all of those users.
Once reports started coming in, they were super worried and semi-freaked out. A war room was setup that day and all the people involved jumped in.
One of the first things that happened after determining what happened was to calm them down and reassure they weren't in trouble. After that the group worked on a solution for a few hours and established a plan to fix everything.
I was actually surprised that the response was so well handled. There was no finger pointing and just a group effort to fix the problem. To me, that's how every problem should be handled, and not instilling fear for losing your job if something bad potentially happens.
To anyone who wants to leave replies about staging databases, bad dev practices, etc, don't bother please. This was years ago and it was how things were done at the company. Our team was not part of the backend team or infra and worked with lots of areas of engineering on different issues.

KronisLV · Answer

The best postmortem I've seen was actually a conference talk: "Debugging Under Fire: Keep your Head when Systems have Lost their Mind" by Bryan Cantrill (in the GOTO conference, 2017)
Here's a YouTube video of it: https://www.youtube.com/watch?v=30jNsCVLpAE
Here's the slides: https://gotochgo.com/2017/sessions/86/keynote-debugging-unde...
It goes into detail about a pretty bad outage (when an entire data center was brought down), the human aspects, automation, how they handled it, the various risks, architectures, how things fail and about software development in general.

jxf · Answer

Unfortunately I can't share any of ours because they're all proprietary client work, but I think my teams have done a really masterful job. They're among some of the best I've read anywhere. For me the standout traits of a good postmortem are:
* Honesty about the reality of the situation; no sugarcoating, no spin
* Blameless, factual tone that avoids the passive voice
* Describes technical details at a level helpful for practitioners
* Makes use of other resources as needed (e.g. references corporate wiki, external ideas, blog posts)
* Good writing that's easy to read and is free of grammatical ambiguities and spelling errors

simonblack · Answer

A young guy driving on a country road without a seatbelt rolled his vehicle and died of his head injuries. Quite fascinating to discover that his only real problem was a badly bruised, contused and bleeding brain. Mind you, he probably would have had heart problems in his 50s because his heart arteries already had plaque and he was only in his twenties.That was the best because it was the only postmortem I've seen.

abirch · Answer

I love premortems at work. Gary Klein came up with the idea of asking why a project could fail before you start it: https://hbr.org/2007/09/performing-a-project-premortem

monroewalker · Answer

Roblox Oct 2021 Outage https://blog.roblox.com/2022/01/roblox-return-to-service-10-...

ahakanbaba · Answer

I learned quite a lot from this https://blog.cloudflare.com/details-of-the-cloudflare-outage...

Eric_WVGG · Answer

&hellip; the description of this topic didn't follow the title in a way that I was expecting at all&hellip;I immediately thought of the old GamaSutra (now GameDeveloper.com) postmortems; interviews with members of the teams behind many classic videogames, great late-night reading. https://www.gamedeveloper.com/audio/10-seminal-game-postmort...

gadders · Answer

I used to have a blog compiling a bunch of them along with articles on best practice creation of post mortems. Unfortunately it never made any money and I took it off line when cpanel put their prices up 1000% and the hosting cost became too much.
I still have a back up somewhere and the domain names. Could maybe put it back up one day if I could spare the time and find a very cheap solution. It was a wordpress blog.
I realise this doesn't help the OP. I just wanted to vent :-)

Icathian · Answer

There was a really great podcast called The Downtime Project that dissected and discussed a postmortem in each episode. There were like a dozen episodes in the first season and they never did make a second one. Pity, I really, really liked it. Might be up your alley, it's only a couple years out of date now.

yash1th · Answer

definitely this - https://about.gitlab.com/blog/2017/02/10/postmortem-of-datab...

mgl · Answer

https://groups.google.com/g/google-appengine/c/p2QKJ0OSLc8

jFriedensreich · Answer

We had a payment system for vaccination field workers in africa that stopped working, so people did not get paid. There was a section in the post mortem template that went something like"what is the impact of the error: there is an angry mob with torches demanding to get paid outside"

tpoacher · Answer

Dont have a link at hand, but the report investigating the infamous Therac incident tops my list.

moremetadata · Answer

https://gvnshtn.com/posts/maersk-me-notpetya/
Its a long read, but gives an insight how the ransomware NotPetya crippled Maersk and how they recovered.
Looking at how the earlier ransomware WannaCry crippled the crown jewels of many countries, highlights a weakness in non diverse systems.
I even know who is behind them, but I cant prove it, so why even mention it? Because I'm getting closer to proving it, which makes this game all the more interesting, even they have weaknesses they have failed to identify!
The WannaCry weekend was when I met Dame Stella Rimington and Baron Jonathon Evans hill walking at Scafell Pike and you can call me the world famous Walter Mitty because people are so obedient to authority.

idlewords · Answer

Our lord and savior Jesus Christ