If you hit the link, you'll see the page appears to be a wall of text, not a simple slide or two. As you read deeper into the report, you'll understand that's an intentional aspect of the report. (I'll also note this is the Columbia explosion, not the better known Challenger disaster O-ring post-mortem discussed by Richard Feynman in his autobiography[1], even though that's a great post mortem as well).
[0]https://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=...
[1]https://www.amazon.com/What-Care-Other-People-Think/dp/03933...
Come D-day, my colleague runs the script on a limited group of users (a 100k or so) to validate. I forget the details, but something in the script was incorrect and ended up breaking some features for all of those users.
Once reports started coming in, they were super worried and semi-freaked out. A war room was setup that day and all the people involved jumped in.
One of the first things that happened after determining what happened was to calm them down and reassure they weren't in trouble. After that the group worked on a solution for a few hours and established a plan to fix everything.
I was actually surprised that the response was so well handled. There was no finger pointing and just a group effort to fix the problem. To me, that's how every problem should be handled, and not instilling fear for losing your job if something bad potentially happens.
To anyone who wants to leave replies about staging databases, bad dev practices, etc, don't bother please. This was years ago and it was how things were done at the company. Our team was not part of the backend team or infra and worked with lots of areas of engineering on different issues.
Here's a YouTube video of it: https://www.youtube.com/watch?v=30jNsCVLpAE
Here's the slides: https://gotochgo.com/2017/sessions/86/keynote-debugging-unde...
It goes into detail about a pretty bad outage (when an entire data center was brought down), the human aspects, automation, how they handled it, the various risks, architectures, how things fail and about software development in general.
* Honesty about the reality of the situation; no sugarcoating, no spin
* Blameless, factual tone that avoids the passive voice
* Describes technical details at a level helpful for practitioners
* Makes use of other resources as needed (e.g. references corporate wiki, external ideas, blog posts)
* Good writing that's easy to read and is free of grammatical ambiguities and spelling errors
That was the best because it was the only postmortem I've seen.
I immediately thought of the old GamaSutra (now GameDeveloper.com) postmortems; interviews with members of the teams behind many classic videogames, great late-night reading. https://www.gamedeveloper.com/audio/10-seminal-game-postmort...
I still have a back up somewhere and the domain names. Could maybe put it back up one day if I could spare the time and find a very cheap solution. It was a wordpress blog.
I realise this doesn't help the OP. I just wanted to vent :-)
"what is the impact of the error: there is an angry mob with torches demanding to get paid outside"
Its a long read, but gives an insight how the ransomware NotPetya crippled Maersk and how they recovered.
Looking at how the earlier ransomware WannaCry crippled the crown jewels of many countries, highlights a weakness in non diverse systems.
I even know who is behind them, but I cant prove it, so why even mention it? Because I'm getting closer to proving it, which makes this game all the more interesting, even they have weaknesses they have failed to identify!
The WannaCry weekend was when I met Dame Stella Rimington and Baron Jonathon Evans hill walking at Scafell Pike and you can call me the world famous Walter Mitty because people are so obedient to authority.