For those on call, how often are you called?

Question

My current role as a IT Operations Engineer has recently forced me to join an on-call rotation, which when on primary support is paging me 5-7 times per week after hours (so literally daily.)I've been in various IT administration, development and DevOps positions for the last 20 years with differing "on call" responsibilities, and have never had anything as intrusive as this.Getting to the point - my current manager says that getting paged every day of your primary support shift is "normal in the industry for operations". While this definitely doesn't match my personal experience - I'm curious: do any of you in technical support roles with "on call" responsibilities get paged this frequently? If not, what does a "normal" shift look like for you?Thanks kindly for any feedback!

kylek · Accepted Answer

Worked at a FAANG, 5-7 was peanuts for the rotation I was on there. The interesting thing (I don't know if I liked it or not) was that when you're on call, that's all you do (even during normal hours, that is), no "normal" work/projects during that time (which relieves a giant burden for everyone NOT on call). At the end of the rotation, there is a proper hand-off to the next on call; every issue that came up is reviewed and a plan put in place to fix it "for good" (meaning a backlog task gets created and assigned to someone during the next sprint planning). If there's no planning to root-cause and fix the underlying problems, run.

wsh · Answer

I wouldn&rsquo;t accept that as normal. In well-run organizations, when there is a regular, ongoing need for evening or overnight coverage, it&rsquo;s provided by people scheduled to work during those hours, who are selected and trained to be able to handle most situations on their own.After-hours calls should come infrequently, or in situations where someone&rsquo;s personal involvement (for example, as the engineer with primary responsibility for a particular component or its maintenance) is indispensable.In my experience, things that need a lot of unplanned attention are more likely to fail, if they haven&rsquo;t already, in ways that have other unacceptable consequences. Fixing them should be a priority for this reason, too.You haven&rsquo;t mentioned why you keep getting paged. Is it the same problem repeatedly, or lots of different problems? Is there any hope of addressing the underlying causes?

aprdm · Answer

I have been a lead devops engineer in my last two companies, both of them with more than 1k VMs on 4+ on prems data centers.
In the first I was on call rotation for a wekend a month for two years and got called twice.
It was 1h of work paid if you didn't called and 4h if the phone rings, if you worked for more than 4h it than went straight to a full day.
Currently I am on call and only get paid if called, but, my manager only calls me on critical situations, have been called 2 times in a year and 7 months. If I get called I get half day of work paid.

AdamGibbins · Answer

This is not normal. Our on-call schedules run 5-9 Monday to Friday, and 5pm Friday to 9am Monday. If I were paged twice in a week that would be a bad week, being paged at all is fairly uncommon now. Historically it would be more common, but no where near daily, that would be entirely unacceptable.We've invested a load of time reducing the frequency of paging incidents over the years, the entire technology organisation recognises the importance of fixing said incidents and how disruptive it is to peoples lives/sleep/etc.

sqldba · Answer

I don&rsquo;t think it&rsquo;s normal.At a previous company I was on call every second week and would receive a call maybe once every few months. That was with many hundreds of servers.At another company I&rsquo;m on once a week per month and get called once or twice. That&rsquo;s with just a few hundred servers.In the first case all time was reimbursed in lieu. In the second case my salary more than makes up for any inconvenience.However in both cases I was very proactive in defining what is on call - critical production issues only. If it&rsquo;s not critical or not production then I won&rsquo;t log on to look at it.And in both cases I had a LOT of false alarms from bad alerts when starting. I had all false alarms disabled.You&rsquo;ll get push back but I didn&rsquo;t care - you can&rsquo;t have an alarm waking up people every night on the off chance that one in a hundred will actually be an error. And hilariously, if you started including your boss on the call, they&rsquo;d quickly agree it&rsquo;s not acceptable. The human cost isn&rsquo;t worth it.While there&rsquo;s often tonnes of room for improvements to monitoring and alerting (root cause analysis etc) that others have mentioned - in my experience most of the metrics and alarms are garbage anyway, and can and should be done away with. If it came from a boxed product it should near all be turned off from the get go. That crap is always pointless.Oh no a server CPU usage has increased and memory is low because - it&rsquo;s doing what it&rsquo;s meant to? What junk.

mduggles · Answer

I mean it depends on whether you are doing anything with the pages and if they’re followed up on. As someone who has been on various oncall rotations for a decade I would describe that as a pretty heavy paging load for an average rotation.
The key criteria for me and paging are:
1. Was the page actionable? Did I need to do something to restore the system to functioning or prevent it from going down.
2. Can I prevent this page in the future and most importantly am I empowered by leadership to do that? If your app is paging me because it’s poorly made and I am not authorized to change it that’s a leadership problem that’s extremely common.
3. Are we auditing the pages? Often alerts in technology are designed in response to a particular problem and then never removed. Paging is, to me, a very serious action for a system to take. It means it is impossible for the system to naturally recover and all automation has failed. So every time we page someone we should as a team review those pages to ensure they’re actionable and actually impossible to naturally recover from.
These criteria have served me well for years and caused me to turn off the vast majority of the alerts of my services.
But you seem to have a culture that accepts this as normal and tbh these rarely change. Just know that it isn’t normal and it’s not acceptable.

zxcvbn4038 · Answer

My advice is to use your time on call to your advantage. Don’t address just the symptoms - when you receive a call try to understand the root cause and take steps to prevent that situation from happening again. For example - if paged for low disk space make sure log rotation is present, working, and aggressive enough to stay ahead of the generation rate. Have the thing that checks the disk space preform the most common remediation steps and then page only if unsuccessful. If your in the cloud then just kill anything that runs out of disk space, it’s the application owners responsability to arrange for long term storage, etc. Do this for every call you receive and soon your phone will be silent.
My employer makes use of Pagerduty and I’ve spent a lot of time setting up “auto-resolve” of alerts. I even hook into AWS autoscaling lifecycle events and send mock “OK” actions when something gets terminated that had thrown an alarm. I still get paged but most issues solve themselves if I wait one more monitoring interval.
I’ve also used being on call as excuse to leave early - to ensure I’m home and able to respond to calls when everyone else leaves the office, not much I can do if I’m stuck in traffic, or in a tunnel, etc.

Niksko · Answer

I'm part of a team that operates a roughly 100 node Kubernetes cluster. I'm on call after hours for a week at a time, and am on call roughly every six weeks. I think I've been on call for three weeks this year, and I've been paged twice. Both of those were pretty straightforward problems solved within half an hour or so, with zero customer impact. This is roughly what other people in my team experience, probably averaging less than 1 page per on call rotation.
The question you should be asking is: why am I being paged so often?
Are they legitimate things that you need to respond to? If so, you should be fixing these issues so that they don't happen again. If anyone gets a page, we make it a high priority to fix whatever caused it. We are a team of 7, and we dedicate one person a week to field questions relating to our platform as well as to fix up these issues that wake us up.
If they're not legitimate things that you need to be woken up for, why are you being woken up? If this is the case, you need to make sure everyone is on the same page regarding what constitutes something you need to be paged for after hours.

algaeontoast · Answer

If I'm not doing devOps work (I explicitly avoid this garbage) and not a founder I expect to not be on call - ever.
So basically, I don't work at companies that make their employees carry a pager etc. Life is too short for that shit.
I worked briefly at a startup shortly after it's acquisition by a FAANG. The startup's code was trash - I acknowledged while on call that I didn't exactly know what was going on after digging a while - asked for help - was then reprimanded for "not knowing the code well enough" basically because I asked for help. I left about a month after that. Again, life is too short for that shit.

photonios · Answer

Rarely. In a team of 3-4 engineers who share the on-call responsibility, I think one of us gets paged every 3-4 months.
Normal shift is like every other day. Just go to work, do my job. Come home, eat, chill a bit and go to sleep.
It used to be more. The company started with three people three years ago (myself included). Now we're over 50. We have enough resources to fix and solve problems before they become real problems.

EdwardDiego · Answer

Hardly ever, but then we've made it an explicit goal that if we're having to fix the system after hours, we need to fix that immediately. It used to be almost daily before we made uninterrupted sleep an explicit priority.

shifto · Answer

Currently a bit more than a year at my current workplace. I have on-call every 4 weeks for a week. This weekend was my third call.