Recently, there've been two discussions on HN [1,2] that have gotten me thinking about this topic again. And now I'm wondering: are there any good books on the topic that you can recommend? I'm not restricting myself to any domain - business, politics, engineering, natural disasters, could all be interesting.
For tech, Dan Liu maintains a list of tech company incident public post-mortems: https://github.com/danluu/post-mortems
The authors worked with multiple organizations and celebrities during time of crises and helped them with the public relation side of it. For instance they managed Bill Clinton's PR during the Monica Lewinsky events.
The book is not perfect (I think it could be shortened a bit and retain the same information) but it is still very interesting and I think about it every time I witness someone getting themselves into a big, public crisis and making things worse by not managing their PR properly.
There's a podcast they do related to it called The Sharp End and it's worth a listen too: https://www.thesharpendpodcast.com/
Thinking in an Emergency (2012) by Elaine Scarry -- She argues for the importance of planning and procedure. Examples include the Swiss shelter system (civil defense), CPR training, and compacts in rural Canada to deal with grain silo fires. She suggests that careful thought before the emergency is vital for "civilization", or at lest for democratic governance. As a converse, the country is destabilized when an opportunist leader comes along, cowboy-style, and says, "This is a crisis, and I'm going to shoot from the hip and fix it". Hence it's both a pragmatic study and a lucid work of political philosophy.
Command and Control (2014) by Eric Schlosser -- This is more of an anti-study: not what to do in an emergency, but the inevitable flaws in complex systems, the limited efficacy of administration, and the inherent failings of human effort. Sort of the mirror-book to Scarry's, but also utterly fascinating. It's a history of disasters of the U.S. nuclear weapons program, and in an effort as huge as that, the disasters certainly exist at scale. (Interestingly, Scarry's book centers on the importance of governance in a nuclear state, or perhaps that democratic governance is not compatible with being a nuclear power.)
I'm reminded of 'The Medical Detectives', Roueche, but only by reputation (I own a copy I haven't read.) "In each true story, local health authorities and epidemiologists race against time to find the clue to an unknown and possibly fatal disease."
If you interpret 'The enemy might get the bomb before we do' as a crisis, 'The Making of the Atomic Bomb', Rhodes, is a detailed (and Pulitzer Prize-winning) examination of how we got from discovering the atom's nucleus to the consequences of deploying city-destroying weapons in a generation or so.
You might find general systems theory interesting, maybe 'Thinking In Systems', Meadows, and/or 'An Introduction to General Systems Thinking', Weinberg.
[1] https://emergency.cdc.gov/cerc/ppt/cerc_2014edition_Copy.pdf https://emergency.cdc.gov/cerc/manual/index.asp
[0] https://www.amazon.com/Checklist-Manifesto-How-Things-Right/...
The term you want for our field is "Incident Response", and the practice of 1)preventing them and 2)handling them 3)learning from them is Resilience Engineering. It's about investigating air plane crashes, nuclear meltdowns, errors during surgery, etc, and learning how humans keep complex systems running.
I recommend "Behind Human Error" by David Woods as a great starter there. A key insight of this field is that incidents aren't just "some idiot didn't follow the safety checklist", but often the safety checklist itself will cause the issue; at some level the errors happen because of complicated interactions between the system and even the safety mechanisms.
An interesting tech industry related document is the STELLA report [1] from a few tech companies comparing notes on incidents.
It's about leadership during crises and it's based on real events. Telling the story of a small battalion stopping German army en route to Moscow in 1941. At some point it was a required reading in some military schools (like in Israel for example, maybe even now).
And as others have suggested, reading flight accident reports, or watching the videos made off of them tends to be valuable.
Also, I think NASA published a bunch of research on human factors at one point, but it's been a long time since I've looked it up.
And last, specific to our industry, the SRE Books have a couple chapters on incident response: https://sre.google/books/
ICS courses are free to anyone who wants them. You can get started with ICS-700 here: https://training.fema.gov/is/courseoverview.aspx?code=IS-700...
There are a few basic principles of ICS that should be useful to company incident response:
1. ICS defines specific roles and their responsibilities. In ICS, there is Planning, Logistics, Operations, Management and/or Coordinator, and Finance, among others. Each of these roles are defined ahead of time, and disaster response teams practice these roles regularly. Each role has defined ways of handing the role off to another person during a shift change and often includes specific forms that need to be filled out. This data collection is integral to being able to review the incident while it's happening, as well as after the fact for improving training.
2. ICS is scalable. For a very small incident, one person may be responsible for all roles. For a very large incident, response may be further subdivided into branches and divisions. This flexibility is an extremely important part of ICS, and it only works because everyone understands the different roles involved.
3. Under ICS, everyone has exactly one boss, supervisor, etc. that they report to. Any of you who have had to try to go spelunking through logs while multiple suits keep contacting you for updates already understands how important this is. This structure also helps to minimize miscommunication during an incident.
4. In the planning section specifically, there's a process called the "Planning P" that describes a lifecycle of information gathering, decision-making, and communication. It's pretty straightforward and it resolves a lot of common issues in incident response. This is covered in ICS-201: https://training.fema.gov/is/courseoverview.aspx?code=is-201
Companies developing their own incident response strategies will want to customize forms, data collection, roles, &etc., but the basic principles of ICS are an effective framework that should be adaptable to a wide variety of situations. Most companies on their worst day aren't dealing with an actual or potential loss of life; experienced ICS people can sleep-walk through a company's worst incident.
I wrote about pandemic bonds last year [0].
[0]: https://as1ndu.xyz/2020/02/fighting-of-disease-pandemics-wit...
[0] https://www.amazon.com/Total-Loss-Collection-First-hand-Acco...
Even if tangentially related, it's an interesting read if only for the historical content.
- In Thin Air - About a mountaineering expedition that turned into disaster on Mount Everest.
- Black Hawk Down - The story of hundreds of US special forces trapped in Mogadishu overnight after a mission went completely sideways
- Leadership in Turbulent Times - About different US presidents leading through crisis and how there is no one singular type of leadership
- The Hard Thing about Hard Things - Leading a startup on the verge of failure to an eventual massive acquisition
- The Sledge Patrol - How a small group, outgunned and out manned fought back Nazi invaders in Greenland