Guys/Girls in startups, how do you manage production incident?

Question

How do you handle production incidents? Do you have a process or tool you use to mitigate and resolve them as fast as you can? Or kept things simple and you're just calling everyone to drop in the same room and start working on it?I'm wondering at what point it becomes essential for a company to give a bit of thought to incident management and start investing in a process or a solution for this.

Mandatum · Accepted Answer

Standard incident response structure - but you get to play with hours and responsibility a bit more. eg I'm going to a gig on Saturday, can you watch the #alerts-outage channel?
All of the major SaaS providers have good blog posts and documents on this process. You don't need to automate and build in rostering and shit, just have a Slack channel like #on-call that says "@dave is watching outages, shift finishes at 9AM and hands over to @jess" then at 9AM @jess ack's, and posts the same message with the person who plans to take over next. It's the person who's currently on-call's responsibility to find a replacement if the other person doesn't show up.
The main issue you'll have is rostering. If you're super early stage, everyone will be happy to do this. Once you have 10-20 employees though, you'll need to distribute this in a fair way so you don't burn anyone out - and make an exceptions process that everyone agrees to (eg @tu worked until 3AM last night on feature for customer X, @sam will take over for them).
only do this with people who are seniors and are comfortable with having candid conversations, and won't martyr themselves.

democracy · Answer

You don't have to go full on enterprise solution, but you could split your services into priorities of 1,2,3 etc depending how critical it is to the customers. Then say if its 1 (the highest priorty), have a process of involving A,B,C people (at the same time or escalating). Priority 1 takes full focus 24/7. Priority 2 takes office hours only, etc. We use pager-duty atm, but not sure about its pricing model. It works ok. Make people take turns on days/weeks so noone is burnt out and provide easy escalating with approachable people so noone is stressed about contacting someone else at night or over the weekend. Don't forget to pay people (or provide early log off/time in-lieu) for those who spend their time fixing/looking at issues.
edit: document you processes, keep updating, nothing worse than waking up at 3 am and having no clue who to contact or what to do ))
edit: update previous incidents/post-mortems

sethammons · Answer

assuming monitoring is in place and someone responsible for responding to pages is available, our standard process is to ack (acknowledge the alarm), investigate, attempt to resolve, and, if that takes longer than X (15min) or affects more customers than Y, we declare an incident. Worth noting that we are aggressive on making sure that pages (esp. during off hours) are actionable - it must require a human and it must be affecting customers.
When an incident is declared, a slack room is spun up that auto-includes links to a zoom, links to the incident response process, and pages important people. We have someone other than the responding engineer be the "incident commander." Their job is to make sure we dot Is, cross Ts, follow up on action items, and generally keep the ball moving forward. "Amanda, you were going to pull the db records, how's that going? Does anyone have insights that could help John?" They work with support to get a user facing message published. They send out periodic updates so higher-ups know what is going on with a focus on customer impact. The initial goal is to mitigate impact, then figure out a fix. During the blameless postmortem, we focus on processes that failed and aim to remove human components in failure. We use these to also share processes on how people found information, fixed things, etc. From here, we decide on what is a system improvement in need of priority and design work vs things we can address at a individual or team level. We then have an internal SLO that says we will address the fixes within N sprints and higher ups pay attention.
This can all be manual to start and then you can start integrating with tooling. I enjoyed working with splunk for log analysis, graphs, and alerts, prometheus alerts, pager duty for paging, and jira for tracking betterments attached to incidents. Right before I left my last gig, we started with some saas incident/post-mortem tool and it was pretty decent but nothing to write home about. I've forgotten their name and can't find anything similar via Google.
As for when you should do this? As soon as your team is big enough that you need to write things down, share information with others, and have something worth improving.