Like many SV companies, on-call isn’t compensated with the rationale that it’s part of your engineering duties. I buy this to some degree—someone does have to be keeping an eye on things—but it's complicated by sizable inequities across the org. _Most_ people have no on-call rotation, many others have a token rotation that’s ~never used, and only a handful of teams have rotations that are quite bad. Management has extricated themselves completely.
Things have been angling slowly worse. In a gambit to prioritize uptime over engineer time, we have more alarms, tighter tolerances, and a larger operation that generates more tail problems. Good for users, but not so good for us. Being able to sleep fully through the night is increasingly rare. There are some false positives, but most are not, and not easily fixed by more engineering.
Expected time to response has lowered to low single digits—theoretically, you should not be exercising or driving if you’re on. The scheme works because many engineers are in their 20s and willing to soak up pain like a sponge. Rotations tend to smaller over time as single people make backroom deals to get out, and new blood is added too slowly.
I’m not trying to get myself out, but want to effect some kind of change. IMO compensation or extra time off would be ideal—not only is it a nod to the cost of on-call, but it also make exchanging shifts easier by adding incentive beyond simple goodwill. The company could easily afford it, but probably doesn’t want to pay for what it can get for free.
I have frequent conversations with my manager and get token “yeah, we’re looking into it”s, but it’s obviously not a priority for anyone up the chain. Has anyone else been in a similar position? Are you paid? What did you do? Suck it up? Leave?
Have you had a conversation with your skip-level manager? If so, then you are probably right that it's not valued up the chain and you should leave because that is a total shit show that is not the norm.
If you haven't, reach out for time on their calendar, and write down your data points on on-call wake-up rates, total of alarms over time, and let the data make the point that this is not sustainable.
The Director should have some options. How big is the rotation? Is the manager in the rotation themselves? When you're on call are you also expected to contribute story points to the sprint? Why are you not able to solve underlying engineering issues that are causing the SLO violations?
If you came to me, I would be shocked, and immediately make a plan with the engineering manager. Any time a person is woken-up by an alarm it's an incident. There needs to be a response to every incident. There needs to be some serious bar-raising and you can't do it yourself. You need an ally in your management chain and if you don't have one, you're better off transferring teams or companies.
At startups, it's harder; you simply can't have 3 dev teams on 3 continents, so someone is going to have to be around at night. The balance we found at the company I work at is that you are oncall Thursday-Thursday and get Friday off. (Not "not oncall", but "don't show up".) This seems fair to me; the free vacation day is really nice! (We're hiring! https://pachyderm.io/careers/)
When I was at Google, I worked on Fiber and we didn't have a dev presence around the world, so we had to be oncall after hours. We had a dedicated operations team with people paid to be at work during strange hours, as you'd expect from an ISP, but some issues were escalated to the dev team, and we had to be around for those. I was also the TL for a monitoring system that informed operations of outages, so my team would need to be around to handle monitoring monitoring ;) We just got paid extra for every hour we were oncall, I remember it being something like $1600 per week, but I forget the exact number. I was happy with this arrangement. Other people weren't, and weren't asked to be oncall, and it didn't count against them in any way. It all seemed fair to me.
If the company is not allocating the proper resources to the issue and its affecting you personally then you need to leave. You have a business relationship with work, don't let it become personal.
Speaking only of your situation, your company isn’t going to appropriately comp you for the on call burden, and they’re going to string you along (“we’re working on it”) as long as they can. If you stay, you will continue to suffer, and unless your comp is exceptional, it doesn’t appear to be worth it.
They might change after enough folks burn out and/or leave, but that’s not within your control. Your quality of life is within your control.
This is twofold; namely your team and management should be aware that you aren't available for normal work capacity when you're on call.
> theoretically, you should not be exercising or driving if you’re on
This is not possibly sustainable; Your company needs to have someone else available, a backup in case one person misses an alert, someones for at least the other 2 shifts, and someone that can cover while driving, eating, exercising, or using facilities.
Your company is just lying to itself if it believes it has any coverage.
> In a gambit to prioritize uptime over engineer time, we have more alarms, tighter tolerances, and a larger operation that generates more tail problems. Good for users, but not so good for us.
This sounds like the crux of the problem. Your company has prioritized rapid fixes over sustainable engineering. The bandaid may be repeatable, but that doesn't make it sustainable with growth. The most simple solution, is that for every amount of time spent on call 2x as much time should be spent in resolving any tech debt that leads to such a situation.
> IMO compensation or extra time off would be ideal
I think that you should negotiate this based solely on the fact that you can no longer sleep. Aka, you should take off days for every night you work
I think the steps you can take are:
1. Make it clear to your manager this is unacceptable, and you will end up looking for alternate teams/jobs if this goes on
2. Make the same thing clear to your skip level
3. Quit / change teams, citing oncall as the issue
There's no point of doing anything else, in my experience. It's someone else's job to make sure that your oncall experience is prioritized. It sucks to leave an otherwise good job.
For extra credits - try to propose some solutions. Why are some issues not solvable by engineering? Would simply resetting expectations mitigate the largest issues/waking up at night?
There are so many red flags in your post, if I were you, I'd start looking for my way out immediately (I don't know your personal situation so ymmv). Company culture changes slowly, and based on the number red flags, I'd say it's probably easier to leave and find something better. If you are staying because you like working with some people, ask them to consider joining the same company where you get hired. If you try to change the organization, be prepared that people will stop liking you because you are not the obedient little code monkey anymore and they won't like that you don't let them exploit you anymore.
> on-call isn’t compensated with the rationale that it’s part of your engineering duties
That's just wrong, do not let them convince you this is normal. If they want you be available at all times, then they should pay. If they don't want to pay, tell them you won't do on calls anymore.
> In a gambit to prioritize uptime over engineer time
I see they are very generous with your time.
> Being able to sleep fully through the night is increasingly rare.
Again, it's a sign that your system is unstable. You need to ask them to prioritize fixes to these issues, even if they can't be solved easily. Take a good look at how development is organized. Do you have automated tests, code reviews, knowledge sharing? Are you always working on features and ignore bugs? Running systems should not be this hard.
> yeah, we’re looking into it
This is an acceptable answer exactly once.
> The company could easily afford it, but probably doesn’t want to pay for what it can get for free.
I didn't want to be philosophical but: Power concedes nothing without a demand.
I was hit by on-call duty pretty hard at some point in my career. I was sleep deprived and was not able to execute on my regular tasks. This also lead me depression and increased my anxiety. Even though I've started to work on my issues with therapist, I was not able to recover and was let go.
Remember about taking care of yourself.
After a lot of thought, asking for on-call is a loose for me. First, there are many people willing to not ask for on-call. Second, I don't want to be associated with the ones that do. Third, even in best case the on-call doesn't seem compensate me enough for spent time. Fourth, it makes it hard negotiating my base salary which is where the money are. Fifth, it puts some unhealthy motivation to spend even more time at work rather than be more efficient with it (for example, work to create environment where I don't have to be on-call or have to spend less time after hours in general).
So, instead, I am showing I take ownership of the area I am working in, I am willing to sometimes decide that the project requires me to spend some extra time, that I am happy to do what is needed to get the job done, and I try to sell it to my company as a complete package.
Now I would definitely ask some questions if I felt that this responsibility was falling disproportionately on my shoulders.
Do all employees at my current level have the same on call responsibilities and schedules? If not, how are custom schedules arrived at?
Do more senior employees work on call rotations and if not, at what job level are they excused?
Are a couple that seem very reasonable to me.
Best thing would be to raise this with your manager, if no real action is taken then leaving or changing team is an option.
Being oncall and paid for it is much better. Here your personal time being lost with no compensation is simply not worth it. In fact if you don't respond in time it may reflect badly on you.
There is an approach that can be taken to focus sprints on only improving oncall but it requires management buy in. How bad is it? Is it something out of your teams control or is it something if you spend an hour over you can fix for good?
If your company doesn't want to pay for the aggravation, put your phone on silent, make sure alarms escalate to your manager, and start looking for another job.
Otherwise, a little extra involvement might be necessary. Ask your manager's private phone number. When there's a problem, share your problem with him. In the middle of the night. They may get some new insights in the difficulties.
Sometimes this is an "up the chain" type problem, but if the other engineers on the team don't agree with you that the on-call rotation is too painful, it's going to be hard to convince management that your judgment is correct.
If you don't want to simply switch teams, my suggestion is to think of what engineering work you can do in order to improve the on-call experience. Then propose that you work on these projects, to your manager. Quantify the amount of engineering time and increased reliability your projects will save. In my experience it is far easier to get management to agree to a specific plan to improve the situation than to get management to find someone else to solve a problem for you.
Another idea - since you work at a large company, there are probably teams who handle this very well at your company. Infrastructure teams who have scaled components that in the past have been overloaded and now are widely used within the company, that sort of thing. Try asking for advice in a "horizontal" way, finding experts on other teams and asking how they have solved these issues in their teams. These "horizontal" experts will be able to give advice that's specific to your company. This is especially true if your team is working on a product area and your coworkers are not specialists in making reliable systems, but your company has infrastructure specialists on other teams.
Change your thinking and approach.
You can do this by culturally re-prioritising the development teams workload to fundamentally treat the root causes for any outage and regular alerts as urgent to be resolved.
The work needed to fix the root cause gets to kick something out of the current sprint to be attended to immediately.
The dev/product team should fundamentally agree the alerts should be rare, not regular.
Instead of just tweaking alarms, and feeling beaten down at the regular issues, change your thinking to tackle the root causes and fix them, just like any bug or new feature.
You’ll become excited that you’re solving the issues.
By having this shared understanding in the dev team to always be resolving root cause of outages, including architecture restructures and rebuilds of components that take weeks or months, you’ll reduce these incidents dramatically.
Finally, by doing this, you share the pain with everyone else - product managers and business leads don’t get their features or other improvements as fast, they now see what you deal with, they’ll ask why things appear to have slowed down, and you can now say you need more resources.
I've also seen teams where this festered and no one fixed it. I usually got called in in the end to fix it. Often the engineers weren't even talking to the managers about the issue and that's all a fix took, a solution that wasn't just more money or more people. It also helps if you can come up with a basic cost benefit analysis in terms of wasted dev time that could be used for something else. This is a language managers speak.
You should really consider and discuss with your manager several of the options in the comments: pay, sleep replacement time, more people on the loop, better automation, tech debt work that is focused on burning down the most common pages, etc. It's never a great idea to show up with only one possible fix, especially when that's "pay me more". They may not be able to, or not thin you are worth, and then your option is leave or deal. If you have quite a few more options maybe a compromise can be reached.
Engineers just suffering in silence and then quitting in anger is really the worst option tho. So open a dialog if you have not about other options.
As some others said, if you're not getting traction, also talk with your skip... you are meeting with your skip right? But don't come to them with problems and gripes. Come to them with possible solutions and get their advice on those solutions, and be open to their suggestions as well.
On-call responsibilities are supposed to be a two way street between an employee and an employer.
Employers expect employees to be on-call and handle production incidents quickly. That's good for the product.
The two way side of it is that employees must have the autonomy and time to fix the root causes of what's paging them to reduce toil.
This is the root of "you build it, you own it". "Own" means having autonomy.
That kind of engineering work does come at the expense of feature delivery. However, it's also good for the product.
Regarding getting paid more for going on-call, from your description, the issue doesn't sound like it's a financial one. If you received $X00 per week more, would that be an acceptable tradeoff for the constant anxiety of your phone paging you at any time or waking up at least once per night?
(source: am ex-PagerDuty and founded a company to help drive software ownership, so I've thought a lot about this)
To address the question in the post title, a team I was on was able to re-negotiate the on-call terms. Our team didn't have any operations to speak of (we just wrote software, and didn't build services) so we were lumped into a rotation for the org we were in. When the pager went off, not only did we not have any familiarity with the system, we didn't have permissions to do anything anyway. We just ended up having to page someone else for every little thing.
We ganged up on management, told them that we simply were not empowered to take any actions during shift to address issues or off shift to improve things, and got taken off that rotation.
Where I'm at now, if someone has a rough night or a couple of rough days, we'll trade part of the shift to give the person a break.
This seems like the crux of the issue. It sounds like there is a long tail of issues that are hard to fix but have large customer impact. Or do they?
If these long tail issues didn't get fixed, how much revenue would it cost? Figuring that out seems key. If it's a lot of revenue, then it would make sense to spend the time to do the hard engineering fixes. If it's not a lot, then it makes sense to let you sleep.
> Management has extricated themselves completely.
This is a big issue too. If the problems warrant waking you up, they should be serious enough to involve management. If they aren't, then it sounds like they're waking you up for no reason.
But overall it sounds like the company for which you work is a complete joke who doesn’t care about employee health and you should leave them asap.
Good luck! Don’t forget, engineers are high in demand across the globe!
The usual incident review and postmortem process can be applied. If they happen so often you can start with applying the process to some subset to start.
Firefighting is a waste of talented technical resources and results in good people leaving.
One way we have been trying to improve this is working with PagerDuty reporting and looking at the total amount of interruptions (not just pages but anytime PagerDuty reminds you for an alert/expired snooze/escalation) with the team. It's very easy to forget the oncall as you leave, but having more eyes on the shifts starts to bring awareness and lots of "why is that still broken" questions that are better answered at 10am vs 3am on a Sunday. I came from a large Operation Center so I know the pain of bad alerts, mostly cya stuff where it was put in place just to make sure the last guy can't get blamed. Sort of like adding 100's of random smoke detectors in a build without any fire suppression. The intention is good but the results are poor.
Outside of the meeting with the team, we also have proper handoff meetings with off call and on call, so they can share what's going on verbally instead of tagging the next person with the alerts. Makes it easier to share what's going on, any weird problems, notes. Also we're not using a 24/7 oncall coverage but 12/5 and 48/2 for the weekends, it's a small change but helps so much. The worst I ran was a 7/24 at a major email company and was paged every three hours, for the entire week. After that I knew the team didn't want to change and I needed to do something about it.
Sleep matters.
I wonder if maybe it's a FAANG thing that cheap startups try to copy?
We were on-call (a week every 3 months) with my last employer but it wasn't too bad and it was spread equally across people. It wasn't compensated but during the on-call you didn't do any product work, just improving monitoring and alerts so that being on-call didn't suck & recovering if something happened at a bad time.
Still, being on-call sucked because we had too many stupid monitors checking on trivial things that weren't important and that people were too afraid to touch.
A few other companies I've been at just have a dedicated infra team on-call, which gets paid more.
This sounds like a fair solution. I would've liked the extra money as a youngster and now I would gladly avoid messing up my limited sleep.
I'm now at a company that takes these things seriously and I'm learning a ton of good habits and I feel really proud of the work that I'm doing.
Cost the company upcoming tenders and my entire team soon followed. Not something I'm proud of. But priority 1. should be personal wellbeing.
One thing I learned from this is, that I fail to convey severity of issues to certain type of managers. Framing the issue in dollars helps in those cases.
And if it is understood and still ignored. Welp, time to move on.
Of course management could fire you for that, but then they don't have anyone on-call anymore, and they're unlikely to find new engineers willing to add it as an unpaid responsibility.
Not the solution for everyone but people overlook the option of working for yourself.
~7 years and a couple of jobs later -- totally worth it.
Just make sure you're using that newly found freedom wisely; don't fight for freedom and then do 3 hours of uninterrupted netflix every night.
Customer (Telstra, Australia's national telecom) wanted us to sleep on site.
We were contractors, so we all agreed we would say yes, provided we were paid for 24 hours.
The customer decided they didn't need us to sleep on site.
Are there are actual user-facing issues occurring every night? That sounds extremely bad and unusual.
i never worked in a company that expected uncompensated on call
even those that had no official policy typically had lenient look-the-other-way approach so you can ,say,take half day off if you got called up at night
people gotta realize engineer/support churn is more expensive long term than giving people a fair deal
The different on-call rotations worked out thusly:
1. on-call was 1 month long. Response times had to be very short. During business hours, there was a large queue of long-tail work that needed to be resolved that was outside my normal work. Most of the employees here were in their 20s and 30s, probably.
2. Small company. Probably 30 devs total. I was on a team of 1, 2 and eventually 3 people. on-call was 24/7 for my team. Response time was about an hour. I was the youngest employee and most employees here were in their 40s or beyond.
3. Smallish company. < 500 employees. Dev team size of 6ish. On-call is a week-long venture. Turn around time is very short, I think 30 minutes? On-call is a dedicated period. Most issues can be resolved during business hours; but, emergencies are handled at all times.
For [2] and [3], there were unwritten patterns around how much you really needed to be at work once your shift was over if on-call was particularly bad.
At [1], the on-call was particularly long and harsh for a couple reasons. In the early days, I heard that the on-call was absolutely horrible. Logs were non-existent, errors were terrible and required a great deal of work. But, it caused developers to feel the pain of not logging properly, not handling errors correctly, and not monitoring usefully. Over time, those issues were resolved, the team has incredible logging and incredible tooling, knowing that they're going to be the ones that have to fix it this time.
At [2], the constant trouble of code prior to my time there caused the developers of the old code to make it more stable. The services eventually became auto-resolving, we had a network operations center (with appropriate work hours that covered the whole day) that had playbooks for all the remaining normal issues; and, the bad stuff made it to us. On-call 24/7 meant I might get called once every couple weeks or less by the end of my tenure there. I lived a normal life.
At [3], we're still learning and the code is in constant churn. Issues come up and we attempt to fix the root cause on most of the issues. Our logging has gradually improved and our monitoring has been improving and they're tweaked to find real issues.
--
My thoughts:
I think on-call is an important experience for developers. Developers should be first responders for their code when it hits production for the first day or two to catch any possible issue.
Developers should know the pain of deploying their change at noon or on a Friday at 5pm, or at 11pm on a Wednesday, so that they accept responsibility and importance if it breaks at those times, and those actions should be above and beyond their on-call rotation.
If the work of the on-call is especially intense, it should be a separate role that the developers take, with a rotation so that that's all that specific developer is working on.
Developers should write code and review code with debugging and tracing and monitoring and self-correction in mind, to reduce on-call pain - and one of the best ways to do that is to make them feel it, themselves.
If your code-base is having as many issues as you suggest, there are probably some common areas and pitfalls that the code has, and maybe they'll be patterns the team can implement each time those same issues come up. As a result, those errors won't come up as frequently.
If the monitors are too noisy with non-errors, then a couple things could be going on. Let's say that the code 500s when someone passes an invalid argument, or a record isn't found. Those probably shouldn't be 500s, so the code needs to be updated for them to not be. On the other hand, if there's a monitor checking for more than 5 401's in a minute, maybe that's a bit strict and should be changed to "more than 10 401s a minute, every minute for 10 minutes; OR more than 200 401s a minute" - that way you catch the big ugly case of "our auth service is down" and aren't caught by people failing to enter their password a bunch (but giving up).
If the code is an absolute and unfixable mess and you don't want to help fix it, if management is not interested in improving common pitfalls, then maybe it's time for you to look for another job.
Here's some additional reading: https://sre.google/sre-book/being-on-call/