HACKER Q&A
📣 rozenmd

Does your team have runbooks for alerts?


Particularly interested in how big your company is, and what you use runbooks for.

Does your team use them? If not, why not?

(By runbook I mean a detailed guide on how to perform a process, normally in response to an incident/alert)


  👤 sethammons Accepted Answer ✓
Last company, yes. Stored in the wiki but eventually some teams left a link in the wiki and moved the runbook to the repo (multiple services, each with their own repo).

Started when someone who didn't write a thing could be paged for a thing. We also strove to have all alerts related to our team's services routed to our team and we rotated on call weekly.

Standard process was, as on call engineer, to update runbooks as needed. Each runbook had key debugging info, a list of alerts with workarounds or suggestions for digging deeper, links to dashboards, stuff like that. Super useful. A runbook missing info would come up in stand up or retro. We really relied on good runbooks for things we had yet to automate into self-healing systems.

New place, we have some basic runbooks in the wiki but we have room to grow in this area. Coincidentally, the new company is about the same size as the old one was when we started to invest in runbooks.


👤 catsarebetter
50-100 engs, yes we do have a runbook but it's not frequently updated, from personal experience most of us search slack before the runbook b/c most issues we just debug as a team on slack, so the records on slack are usually better

👤 bckr
Company of about 25.

We use runbooks written in Notion to give step-by-step guides for when things go wrong (which they do, a lot). We're getting better about making sure every alert has a runbook linked, and improving them over time.

What I really want is to use something like DeepNote to have runbooks that automatically gather context from databases and logs.


👤 conradludgate
150 engineers, we do use runbooks.

For any incidents, we can open up the run book to get quick access to kibana queries, grafana dashboards etc that make our process easier while under the pressure

My team currently doesn't have any used services yet though so I'm yet to have used them. That's changing soon though


👤 rufius
2000+ engineers. Yes - we use runbooks broadly.

Most are in a wiki with a few corner cases using GDocs or Dropbox Paper.


👤 sidcool
Yes. They are immensely helpful. Right now they are static markdown files. But soon I hope to make them more dynamic and possibly executable.

👤 Raed667
Yes, mainly using notion. Some alerts have links directly attached to them if the root cause is common/known

(reboot that db, or unstuck that kafka, etc)