HACKER Q&A
📣 ymnska

Any luck negotiating better terms for on-call?


(Throwaway account.) I work for a large software company on an on-call rotation that’s been getting more toilsome, and wondering if anyone has been in a similar place.

Like many SV companies, on-call isn’t compensated with the rationale that it’s part of your engineering duties. I buy this to some degree—someone does have to be keeping an eye on things—but it's complicated by sizable inequities across the org. _Most_ people have no on-call rotation, many others have a token rotation that’s ~never used, and only a handful of teams have rotations that are quite bad. Management has extricated themselves completely.

Things have been angling slowly worse. In a gambit to prioritize uptime over engineer time, we have more alarms, tighter tolerances, and a larger operation that generates more tail problems. Good for users, but not so good for us. Being able to sleep fully through the night is increasingly rare. There are some false positives, but most are not, and not easily fixed by more engineering.

Expected time to response has lowered to low single digits—theoretically, you should not be exercising or driving if you’re on. The scheme works because many engineers are in their 20s and willing to soak up pain like a sponge. Rotations tend to smaller over time as single people make backroom deals to get out, and new blood is added too slowly.

I’m not trying to get myself out, but want to effect some kind of change. IMO compensation or extra time off would be ideal—not only is it a nod to the cost of on-call, but it also make exchanging shifts easier by adding incentive beyond simple goodwill. The company could easily afford it, but probably doesn’t want to pay for what it can get for free.

I have frequent conversations with my manager and get token “yeah, we’re looking into it”s, but it’s obviously not a priority for anyone up the chain. Has anyone else been in a similar position? Are you paid? What did you do? Suck it up? Leave?


  👤 encoderer Accepted Answer ✓
Former SV engineering manager and engineering director here:

Have you had a conversation with your skip-level manager? If so, then you are probably right that it's not valued up the chain and you should leave because that is a total shit show that is not the norm.

If you haven't, reach out for time on their calendar, and write down your data points on on-call wake-up rates, total of alarms over time, and let the data make the point that this is not sustainable.

The Director should have some options. How big is the rotation? Is the manager in the rotation themselves? When you're on call are you also expected to contribute story points to the sprint? Why are you not able to solve underlying engineering issues that are causing the SLO violations?

If you came to me, I would be shocked, and immediately make a plan with the engineering manager. Any time a person is woken-up by an alarm it's an incident. There needs to be a response to every incident. There needs to be some serious bar-raising and you can't do it yourself. You need an ally in your management chain and if you don't have one, you're better off transferring teams or companies.


👤 jrockway
At big companies, it should be possible to have enough engineering offices to staff oncall with normal 8 hour days. (8 hours in Tokyo, 8 hours in London, 8 hours in San Francisco, or similar.)

At startups, it's harder; you simply can't have 3 dev teams on 3 continents, so someone is going to have to be around at night. The balance we found at the company I work at is that you are oncall Thursday-Thursday and get Friday off. (Not "not oncall", but "don't show up".) This seems fair to me; the free vacation day is really nice! (We're hiring! https://pachyderm.io/careers/)

When I was at Google, I worked on Fiber and we didn't have a dev presence around the world, so we had to be oncall after hours. We had a dedicated operations team with people paid to be at work during strange hours, as you'd expect from an ISP, but some issues were escalated to the dev team, and we had to be around for those. I was also the TL for a monitoring system that informed operations of outages, so my team would need to be around to handle monitoring monitoring ;) We just got paid extra for every hour we were oncall, I remember it being something like $1600 per week, but I forget the exact number. I was happy with this arrangement. Other people weren't, and weren't asked to be oncall, and it didn't count against them in any way. It all seemed fair to me.


👤 annoyingnoob
I've had on-call duties in most of my positions for over 25 years. In my experience, if the experience for the on-call person is terrible then you probably need code/infrastructure improvements to make it more stable and/or more hands to do the work.

If the company is not allocating the proper resources to the issue and its affecting you personally then you need to leave. You have a business relationship with work, don't let it become personal.


👤 toomuchtodo
Leave.

Speaking only of your situation, your company isn’t going to appropriately comp you for the on call burden, and they’re going to string you along (“we’re working on it”) as long as they can. If you stay, you will continue to suffer, and unless your comp is exceptional, it doesn’t appear to be worth it.

They might change after enough folks burn out and/or leave, but that’s not within your control. Your quality of life is within your control.


👤 smileysteve
> on-call isn’t compensated with the rationale that it’s part of your engineering duties

This is twofold; namely your team and management should be aware that you aren't available for normal work capacity when you're on call.

> theoretically, you should not be exercising or driving if you’re on

This is not possibly sustainable; Your company needs to have someone else available, a backup in case one person misses an alert, someones for at least the other 2 shifts, and someone that can cover while driving, eating, exercising, or using facilities.

Your company is just lying to itself if it believes it has any coverage.

> In a gambit to prioritize uptime over engineer time, we have more alarms, tighter tolerances, and a larger operation that generates more tail problems. Good for users, but not so good for us.

This sounds like the crux of the problem. Your company has prioritized rapid fixes over sustainable engineering. The bandaid may be repeatable, but that doesn't make it sustainable with growth. The most simple solution, is that for every amount of time spent on call 2x as much time should be spent in resolving any tech debt that leads to such a situation.

> IMO compensation or extra time off would be ideal

I think that you should negotiate this based solely on the fact that you can no longer sleep. Aka, you should take off days for every night you work


👤 ublaze
We have a pretty tight oncall (5 min response time).

I think the steps you can take are:

1. Make it clear to your manager this is unacceptable, and you will end up looking for alternate teams/jobs if this goes on

2. Make the same thing clear to your skip level

3. Quit / change teams, citing oncall as the issue

There's no point of doing anything else, in my experience. It's someone else's job to make sure that your oncall experience is prioritized. It sucks to leave an otherwise good job.

For extra credits - try to propose some solutions. Why are some issues not solvable by engineering? Would simply resetting expectations mitigate the largest issues/waking up at night?


👤 serial_dev
You need to pressure them. They have no incentives to change this system, so give them incentive.

There are so many red flags in your post, if I were you, I'd start looking for my way out immediately (I don't know your personal situation so ymmv). Company culture changes slowly, and based on the number red flags, I'd say it's probably easier to leave and find something better. If you are staying because you like working with some people, ask them to consider joining the same company where you get hired. If you try to change the organization, be prepared that people will stop liking you because you are not the obedient little code monkey anymore and they won't like that you don't let them exploit you anymore.

> on-call isn’t compensated with the rationale that it’s part of your engineering duties

That's just wrong, do not let them convince you this is normal. If they want you be available at all times, then they should pay. If they don't want to pay, tell them you won't do on calls anymore.

> In a gambit to prioritize uptime over engineer time

I see they are very generous with your time.

> Being able to sleep fully through the night is increasingly rare.

Again, it's a sign that your system is unstable. You need to ask them to prioritize fixes to these issues, even if they can't be solved easily. Take a good look at how development is organized. Do you have automated tests, code reviews, knowledge sharing? Are you always working on features and ignore bugs? Running systems should not be this hard.

> yeah, we’re looking into it

This is an acceptable answer exactly once.

> The company could easily afford it, but probably doesn’t want to pay for what it can get for free.

I didn't want to be philosophical but: Power concedes nothing without a demand.


👤 dGhyb3dhd2F5
My suggestion would be to take care of you health. There are companies that can afford to pay you for each and every extra hour of on-call, but they won't necessarily look for you well-being. Having extra time off sounds nice, but on the other hand it will affect the other stuff that you usually have to do at your job.

I was hit by on-call duty pretty hard at some point in my career. I was sleep deprived and was not able to execute on my regular tasks. This also lead me depression and increased my anxiety. Even though I've started to work on my issues with therapist, I was not able to recover and was let go.

Remember about taking care of yourself.


👤 lmilcin
I gave up this idea for couple of reasons and instead am negotiating for whole package which I just assume includes some amount of after hours / on-call work.

After a lot of thought, asking for on-call is a loose for me. First, there are many people willing to not ask for on-call. Second, I don't want to be associated with the ones that do. Third, even in best case the on-call doesn't seem compensate me enough for spent time. Fourth, it makes it hard negotiating my base salary which is where the money are. Fifth, it puts some unhealthy motivation to spend even more time at work rather than be more efficient with it (for example, work to create environment where I don't have to be on-call or have to spend less time after hours in general).

So, instead, I am showing I take ownership of the area I am working in, I am willing to sometimes decide that the project requires me to spend some extra time, that I am happy to do what is needed to get the job done, and I try to sell it to my company as a complete package.


👤 hkrgl
The fact that it's not a priority up the management chain is a red flag. Try speaking with the skip-level manager or higher to see if you get some attention. At the very least, I would ask for days off to compensate for a bad rotation. Is fixing root cause prioritized? If you are just adding more alerts and not prioritizing fixing underlying issues, that's another red flag. Try to see if you and your team members can get your management to prioritize this. Having a group of people asking for this can be more effective than just one person. If you have enough leverage, push back on new feature work until fixing oncall problems becomes a priority. If all of this fails, leave - there are any number of places where oncall is not as much of a burden and/or you are compensated for your time. Good luck!

👤 uberman
I worked as a developer and consultant at a big name SV company and was not compensated for "on call" time and in the end, not knowing any better just "sucked it up".

Now I would definitely ask some questions if I felt that this responsibility was falling disproportionately on my shoulders.

Do all employees at my current level have the same on call responsibilities and schedules? If not, how are custom schedules arrived at?

Do more senior employees work on call rotations and if not, at what job level are they excused?

Are a couple that seem very reasonable to me.


👤 M2Ys4U
If I were you I'd go have a talk with your union rep, especially as this is impacting multiple teams. They can then raise this higher up the management chain.

👤 paktek123
This should not be taken lightly, your life is effected by this since you mentioned you cannot sleep at night, nor exercise not drive. As the company advances your duties will increase and personal time gone.

Best thing would be to raise this with your manager, if no real action is taken then leaving or changing team is an option.

Being oncall and paid for it is much better. Here your personal time being lost with no compensation is simply not worth it. In fact if you don't respond in time it may reflect badly on you.

There is an approach that can be taken to focus sprints on only improving oncall but it requires management buy in. How bad is it? Is it something out of your teams control or is it something if you spend an hour over you can fix for good?


👤 icedchai
At a previous company, on-call was optional and we were compensated for that week when we were on call (an extra $500, IIRC.) About 50% of the team of 12 participated, meaning your week came up basically every month and a half. This seemed fair.

If your company doesn't want to pay for the aggravation, put your phone on silent, make sure alarms escalate to your manager, and start looking for another job.


👤 uhtred
Companies should have permanent support teams that work shifts, and all they do is support work. That way people who are happy to do shift work (i.e. work nights sometimes) and get paid more for the trouble can do that, and regular engineers can not have the stress of having to be on-call. I think it's absolutely ridiculous that I have to be willing to get woken up at 3am and work on something, and then do a day at work the next day.

👤 bartvk
Sometimes, it's difficult to get paid extra due to organizational barriers. Then a reasonable option is to get time off. For each hour you lost on sleep, perhaps that day you stop working earlier 2 hours. This need not be discussed, you can simply tell your manager.

Otherwise, a little extra involvement might be necessary. Ask your manager's private phone number. When there's a problem, share your problem with him. In the middle of the night. They may get some new insights in the difficulties.


👤 soperj
Where I work, we're paid 1/3 of our regular rate for every hour on call. People are much less likely to want you to be on call for no reason when that's the case.

👤 lacker
Extra compensation or time off is usually a bad idea for on-call responsibilities, because it puts the wrong incentives in place. Teams should be working to improve their infrastructure so that on-call is less painful, not lobbying for additional pay because someone built a hard-to-maintain system.

Sometimes this is an "up the chain" type problem, but if the other engineers on the team don't agree with you that the on-call rotation is too painful, it's going to be hard to convince management that your judgment is correct.

If you don't want to simply switch teams, my suggestion is to think of what engineering work you can do in order to improve the on-call experience. Then propose that you work on these projects, to your manager. Quantify the amount of engineering time and increased reliability your projects will save. In my experience it is far easier to get management to agree to a specific plan to improve the situation than to get management to find someone else to solve a problem for you.

Another idea - since you work at a large company, there are probably teams who handle this very well at your company. Infrastructure teams who have scaled components that in the past have been overloaded and now are widely used within the company, that sort of thing. Try asking for advice in a "horizontal" way, finding experts on other teams and asking how they have solved these issues in their teams. These "horizontal" experts will be able to give advice that's specific to your company. This is especially true if your team is working on a product area and your coworkers are not specialists in making reliable systems, but your company has infrastructure specialists on other teams.


👤 plasma
Fundamentally I’d suggest you share this pain with your team, including any product managers and management/business development teams.

Change your thinking and approach.

You can do this by culturally re-prioritising the development teams workload to fundamentally treat the root causes for any outage and regular alerts as urgent to be resolved.

The work needed to fix the root cause gets to kick something out of the current sprint to be attended to immediately.

The dev/product team should fundamentally agree the alerts should be rare, not regular.

Instead of just tweaking alarms, and feeling beaten down at the regular issues, change your thinking to tackle the root causes and fix them, just like any bug or new feature.

You’ll become excited that you’re solving the issues.

By having this shared understanding in the dev team to always be resolving root cause of outages, including architecture restructures and rebuilds of components that take weeks or months, you’ll reduce these incidents dramatically.

Finally, by doing this, you share the pain with everyone else - product managers and business leads don’t get their features or other improvements as fast, they now see what you deal with, they’ll ask why things appear to have slowed down, and you can now say you need more resources.


👤 grogenaut
I've seen this on many teams. There are several other options are you not looking at. One is changing the systems to the on-call becomes way less burdensome and much more automated. Not sure if this is an option or not. This isn't easy to implement (eg I've seen many engineers misunderstand the problem and focus on cool tech often, this isn't a blank check) but it's a great option and one I've seen get people promoted in the long term.

I've also seen teams where this festered and no one fixed it. I usually got called in in the end to fix it. Often the engineers weren't even talking to the managers about the issue and that's all a fix took, a solution that wasn't just more money or more people. It also helps if you can come up with a basic cost benefit analysis in terms of wasted dev time that could be used for something else. This is a language managers speak.

You should really consider and discuss with your manager several of the options in the comments: pay, sleep replacement time, more people on the loop, better automation, tech debt work that is focused on burning down the most common pages, etc. It's never a great idea to show up with only one possible fix, especially when that's "pay me more". They may not be able to, or not thin you are worth, and then your option is leave or deal. If you have quite a few more options maybe a compromise can be reached.

Engineers just suffering in silence and then quitting in anger is really the worst option tho. So open a dialog if you have not about other options.

As some others said, if you're not getting traction, also talk with your skip... you are meeting with your skip right? But don't come to them with problems and gripes. Come to them with possible solutions and get their advice on those solutions, and be open to their suggestions as well.


👤 kenrose
I'm probably piling on, but wanted to echo a lot of what's been said.

On-call responsibilities are supposed to be a two way street between an employee and an employer.

Employers expect employees to be on-call and handle production incidents quickly. That's good for the product.

The two way side of it is that employees must have the autonomy and time to fix the root causes of what's paging them to reduce toil.

This is the root of "you build it, you own it". "Own" means having autonomy.

That kind of engineering work does come at the expense of feature delivery. However, it's also good for the product.

Regarding getting paid more for going on-call, from your description, the issue doesn't sound like it's a financial one. If you received $X00 per week more, would that be an acceptable tradeoff for the constant anxiety of your phone paging you at any time or waking up at least once per night?

(source: am ex-PagerDuty and founded a company to help drive software ownership, so I've thought a lot about this)


👤 pipingdog
First of all, on-call without the latitude to prioritize operational improvements (making the cause of the pages go away) over feature work is a non-starter.

To address the question in the post title, a team I was on was able to re-negotiate the on-call terms. Our team didn't have any operations to speak of (we just wrote software, and didn't build services) so we were lumped into a rotation for the org we were in. When the pager went off, not only did we not have any familiarity with the system, we didn't have permissions to do anything anyway. We just ended up having to page someone else for every little thing.

We ganged up on management, told them that we simply were not empowered to take any actions during shift to address issues or off shift to improve things, and got taken off that rotation.

Where I'm at now, if someone has a rough night or a couple of rough days, we'll trade part of the shift to give the person a break.


👤 jedberg
> There are some false positives, but most are not, and not easily fixed by more engineering.

This seems like the crux of the issue. It sounds like there is a long tail of issues that are hard to fix but have large customer impact. Or do they?

If these long tail issues didn't get fixed, how much revenue would it cost? Figuring that out seems key. If it's a lot of revenue, then it would make sense to spend the time to do the hard engineering fixes. If it's not a lot, then it makes sense to let you sleep.

> Management has extricated themselves completely.

This is a big issue too. If the problems warrant waking you up, they should be serious enough to involve management. If they aren't, then it sounds like they're waking you up for no reason.


👤 dustinmoris
Sounds like your company should hire people from different time zones and have the on-call follow the daylight so that teams can fulfill that responsibility during normal work hours, especially when expected response time is in single digit territory and on-call isn’t paid extra.

But overall it sounds like the company for which you work is a complete joke who doesn’t care about employee health and you should leave them asap.

Good luck! Don’t forget, engineers are high in demand across the globe!


👤 parentheses
On call being too eventful is a bug (arch, infra, code). The solution is to propose that every wake is responded to as something that must be prevented going forward.

The usual incident review and postmortem process can be applied. If they happen so often you can start with applying the process to some subset to start.

Firefighting is a waste of talented technical resources and results in good people leaving.


👤 rfreiberger
I'm not in development or engineering directly but work in an operations role (mostly Puppet and Terraform work) where I've been oncall for the majority of my career. One thing that is common when things get really bad is the mindset of oncall doesn't count towards my role at the company. Many people see it as the painful work of cleaning up while the others are out building the next big thing. So it's easy to see people jump into the shift, deal with the mass alerts, then leave without making any improvements for the next guy or gal.

One way we have been trying to improve this is working with PagerDuty reporting and looking at the total amount of interruptions (not just pages but anytime PagerDuty reminds you for an alert/expired snooze/escalation) with the team. It's very easy to forget the oncall as you leave, but having more eyes on the shifts starts to bring awareness and lots of "why is that still broken" questions that are better answered at 10am vs 3am on a Sunday. I came from a large Operation Center so I know the pain of bad alerts, mostly cya stuff where it was put in place just to make sure the last guy can't get blamed. Sort of like adding 100's of random smoke detectors in a build without any fire suppression. The intention is good but the results are poor.

Outside of the meeting with the team, we also have proper handoff meetings with off call and on call, so they can share what's going on verbally instead of tagging the next person with the alerts. Makes it easier to share what's going on, any weird problems, notes. Also we're not using a 24/7 oncall coverage but 12/5 and 48/2 for the weekends, it's a small change but helps so much. The worst I ran was a 7/24 at a major email company and was paged every three hours, for the entire week. After that I knew the team didn't want to change and I needed to do something about it.


👤 jokethrowaway
I'd quit, there are plenty of companies that don't do that. Look into contracting maybe, I've never heard of a contractor being on call.

Sleep matters.

I wonder if maybe it's a FAANG thing that cheap startups try to copy?

We were on-call (a week every 3 months) with my last employer but it wasn't too bad and it was spread equally across people. It wasn't compensated but during the on-call you didn't do any product work, just improving monitoring and alerts so that being on-call didn't suck & recovering if something happened at a bad time.

Still, being on-call sucked because we had too many stupid monitors checking on trivial things that weren't important and that people were too afraid to touch.

A few other companies I've been at just have a dedicated infra team on-call, which gets paid more.

This sounds like a fair solution. I would've liked the extra money as a youngster and now I would gladly avoid messing up my limited sleep.


👤 rednerrus
If the company you work for doesn't take your time seriously, it's time to look else where. I did what you're doing for a couple of years at a company that just didn't have the leadership it took to get things fixed. I was miserable every day. Partly because on-call sucked but more importantly because I was working on a project that felt like I was shoveling quicksand. Your time, your energy, and your self-esteem are worth too much to be shoveling quicksand for people who don't understand how to do things the right way.

I'm now at a company that takes these things seriously and I'm learning a ton of good habits and I feel really proud of the work that I'm doing.


👤 maybenotafart
personally, I just outright refuse to be on call. That cheap bastard needs to hire a SRE. I explicitly ask the question in interview, I let the manager know I have zero interest in doing such a thing. and when they force my hand, I go somewhere else.

👤 RA_Fisher
Incidents are unplanned investments. They’re a decision by leadership and sap the company’s prospects. Certainly negotiate. Maybe it’s your only job? Maybe it’s a 50% pay boost? Ideally you can show the company a better way: keep records of incidents, make statistical measurements to set expectations and then change from the reactive to proactive stance. The idea is to make things sufficiently reliable _before_ the customer experiences the incident (not after!). However, on the margin, if you’re not being sufficiently compensated, you might need to find wiser leadership.

👤 wieghant
Left and wished I had done it earlier. Once I gave my notice, that's when they gave a counter offer but I had made up my mind.

Cost the company upcoming tenders and my entire team soon followed. Not something I'm proud of. But priority 1. should be personal wellbeing.

One thing I learned from this is, that I fail to convey severity of issues to certain type of managers. Framing the issue in dollars helps in those cases.

And if it is understood and still ignored. Welp, time to move on.


👤 giantg2
The teams I've been on have a rotation that is pretty fair, but unpaid. Most technical members are required to be on the list, with a few reasonable exceptions. If there's someone who isn't in the list, people will joke about them not being on the list and that seems to put pressure on adding them (this won't work in all company cultures). It's usually 1 week every 6-10 weeks depending on team size.

👤 mcv
I'd make sure you discuss this with other engineers who are on-call. If everybody agrees, you could declare together that from now on, on-call is your only responsibility and you won't be taking up other engineering duties.

Of course management could fire you for that, but then they don't have anyone on-call anymore, and they're unlikely to find new engineers willing to add it as an unpaid responsibility.


👤 seanwilson
One option for some is to go freelance and make it clear when you take on contracts that you're not available on call (the client's team or another freelancer you work with could step in for example), or at the least you can negotiate fair compensation for it.

Not the solution for everyone but people overlook the option of working for yourself.


👤 blaser-waffle
I had to take a pay cut and switch to a purely service delivery / provisioning role before I got out of On-Call; same org, different role.

~7 years and a couple of jobs later -- totally worth it.

Just make sure you're using that newly found freedom wisely; don't fight for freedom and then do 3 hours of uninterrupted netflix every night.


👤 nailer
Was in a team of around 12 engineers for a XXXM AUD dollar telco project a decade ago.

Customer (Telstra, Australia's national telecom) wanted us to sleep on site.

We were contractors, so we all agreed we would say yes, provided we were paid for 24 hours.

The customer decided they didn't need us to sleep on site.


👤 sys_64738
I used to do on call where each time I got called out was a three hour charge. It's great when you're young and single but not any more. I would never do on-call for 'free'. My time is more valuable and I work to live.

👤 imposterr
You didn't specifically list where you're based, but worth checking local laws. Some places have requirements for on-call work and companies simply get around it because most employees don't know.

👤 2rsf
The union was called for help, but this is Sweden where a union is not something to be afraid from and employers actually talk to them with respect.

👤 3minus1
> Being able to sleep fully through the night is increasingly rare.

Are there are actual user-facing issues occurring every night? That sounds extremely bad and unusual.


👤 eecks
What is the 'punishment' for not getting to a call on time?

👤 pertymcpert
Why doesn't the company hire someone in another time zone?

👤 aflag
What's an SV company? I googled for SV companies and Software V companies and found nothing. Many companies seem to have SV in their names, though.

👤 saargrin
how is this not even obvious?

i never worked in a company that expected uncompensated on call

even those that had no official policy typically had lenient look-the-other-way approach so you can ,say,take half day off if you got called up at night

people gotta realize engineer/support churn is more expensive long term than giving people a fair deal


👤 temikus
If you can - leave. There’s plenty of companies with paid on call or more lenient policies.

👤 renewiltord
You can change things if you have power and demonstrate that it'll work. Otherwise, leave.

👤 rajacombinator
Sounds like a sweatshop. Just leave.

👤 t-writescode
I have worked for 3 separate companies in my decade of software development where I have had to be on-call. One of these companies was an organization at a very large and prominent software company.

The different on-call rotations worked out thusly:

1. on-call was 1 month long. Response times had to be very short. During business hours, there was a large queue of long-tail work that needed to be resolved that was outside my normal work. Most of the employees here were in their 20s and 30s, probably.

2. Small company. Probably 30 devs total. I was on a team of 1, 2 and eventually 3 people. on-call was 24/7 for my team. Response time was about an hour. I was the youngest employee and most employees here were in their 40s or beyond.

3. Smallish company. < 500 employees. Dev team size of 6ish. On-call is a week-long venture. Turn around time is very short, I think 30 minutes? On-call is a dedicated period. Most issues can be resolved during business hours; but, emergencies are handled at all times.

For [2] and [3], there were unwritten patterns around how much you really needed to be at work once your shift was over if on-call was particularly bad.

At [1], the on-call was particularly long and harsh for a couple reasons. In the early days, I heard that the on-call was absolutely horrible. Logs were non-existent, errors were terrible and required a great deal of work. But, it caused developers to feel the pain of not logging properly, not handling errors correctly, and not monitoring usefully. Over time, those issues were resolved, the team has incredible logging and incredible tooling, knowing that they're going to be the ones that have to fix it this time.

At [2], the constant trouble of code prior to my time there caused the developers of the old code to make it more stable. The services eventually became auto-resolving, we had a network operations center (with appropriate work hours that covered the whole day) that had playbooks for all the remaining normal issues; and, the bad stuff made it to us. On-call 24/7 meant I might get called once every couple weeks or less by the end of my tenure there. I lived a normal life.

At [3], we're still learning and the code is in constant churn. Issues come up and we attempt to fix the root cause on most of the issues. Our logging has gradually improved and our monitoring has been improving and they're tweaked to find real issues.

--

My thoughts:

I think on-call is an important experience for developers. Developers should be first responders for their code when it hits production for the first day or two to catch any possible issue.

Developers should know the pain of deploying their change at noon or on a Friday at 5pm, or at 11pm on a Wednesday, so that they accept responsibility and importance if it breaks at those times, and those actions should be above and beyond their on-call rotation.

If the work of the on-call is especially intense, it should be a separate role that the developers take, with a rotation so that that's all that specific developer is working on.

Developers should write code and review code with debugging and tracing and monitoring and self-correction in mind, to reduce on-call pain - and one of the best ways to do that is to make them feel it, themselves.

If your code-base is having as many issues as you suggest, there are probably some common areas and pitfalls that the code has, and maybe they'll be patterns the team can implement each time those same issues come up. As a result, those errors won't come up as frequently.

If the monitors are too noisy with non-errors, then a couple things could be going on. Let's say that the code 500s when someone passes an invalid argument, or a record isn't found. Those probably shouldn't be 500s, so the code needs to be updated for them to not be. On the other hand, if there's a monitor checking for more than 5 401's in a minute, maybe that's a bit strict and should be changed to "more than 10 401s a minute, every minute for 10 minutes; OR more than 200 401s a minute" - that way you catch the big ugly case of "our auth service is down" and aren't caught by people failing to enter their password a bunch (but giving up).

If the code is an absolute and unfixable mess and you don't want to help fix it, if management is not interested in improving common pitfalls, then maybe it's time for you to look for another job.

Here's some additional reading: https://sre.google/sre-book/being-on-call/


👤 derision
Write better code that doesn't need support