I've been in various IT administration, development and DevOps positions for the last 20 years with differing "on call" responsibilities, and have never had anything as intrusive as this.
Getting to the point - my current manager says that getting paged every day of your primary support shift is "normal in the industry for operations". While this definitely doesn't match my personal experience - I'm curious: do any of you in technical support roles with "on call" responsibilities get paged this frequently? If not, what does a "normal" shift look like for you?
Thanks kindly for any feedback!
After-hours calls should come infrequently, or in situations where someone’s personal involvement (for example, as the engineer with primary responsibility for a particular component or its maintenance) is indispensable.
In my experience, things that need a lot of unplanned attention are more likely to fail, if they haven’t already, in ways that have other unacceptable consequences. Fixing them should be a priority for this reason, too.
You haven’t mentioned why you keep getting paged. Is it the same problem repeatedly, or lots of different problems? Is there any hope of addressing the underlying causes?
In the first I was on call rotation for a wekend a month for two years and got called twice.
It was 1h of work paid if you didn't called and 4h if the phone rings, if you worked for more than 4h it than went straight to a full day.
Currently I am on call and only get paid if called, but, my manager only calls me on critical situations, have been called 2 times in a year and 7 months. If I get called I get half day of work paid.
We've invested a load of time reducing the frequency of paging incidents over the years, the entire technology organisation recognises the importance of fixing said incidents and how disruptive it is to peoples lives/sleep/etc.
At a previous company I was on call every second week and would receive a call maybe once every few months. That was with many hundreds of servers.
At another company I’m on once a week per month and get called once or twice. That’s with just a few hundred servers.
In the first case all time was reimbursed in lieu. In the second case my salary more than makes up for any inconvenience.
However in both cases I was very proactive in defining what is on call - critical production issues only. If it’s not critical or not production then I won’t log on to look at it.
And in both cases I had a LOT of false alarms from bad alerts when starting. I had all false alarms disabled.
You’ll get push back but I didn’t care - you can’t have an alarm waking up people every night on the off chance that one in a hundred will actually be an error. And hilariously, if you started including your boss on the call, they’d quickly agree it’s not acceptable. The human cost isn’t worth it.
While there’s often tonnes of room for improvements to monitoring and alerting (root cause analysis etc) that others have mentioned - in my experience most of the metrics and alarms are garbage anyway, and can and should be done away with. If it came from a boxed product it should near all be turned off from the get go. That crap is always pointless.
Oh no a server CPU usage has increased and memory is low because - it’s doing what it’s meant to? What junk.
The key criteria for me and paging are:
1. Was the page actionable? Did I need to do something to restore the system to functioning or prevent it from going down.
2. Can I prevent this page in the future and most importantly am I empowered by leadership to do that? If your app is paging me because it’s poorly made and I am not authorized to change it that’s a leadership problem that’s extremely common.
3. Are we auditing the pages? Often alerts in technology are designed in response to a particular problem and then never removed. Paging is, to me, a very serious action for a system to take. It means it is impossible for the system to naturally recover and all automation has failed. So every time we page someone we should as a team review those pages to ensure they’re actionable and actually impossible to naturally recover from.
These criteria have served me well for years and caused me to turn off the vast majority of the alerts of my services.
But you seem to have a culture that accepts this as normal and tbh these rarely change. Just know that it isn’t normal and it’s not acceptable.
My employer makes use of Pagerduty and I’ve spent a lot of time setting up “auto-resolve” of alerts. I even hook into AWS autoscaling lifecycle events and send mock “OK” actions when something gets terminated that had thrown an alarm. I still get paged but most issues solve themselves if I wait one more monitoring interval.
I’ve also used being on call as excuse to leave early - to ensure I’m home and able to respond to calls when everyone else leaves the office, not much I can do if I’m stuck in traffic, or in a tunnel, etc.
The question you should be asking is: why am I being paged so often?
Are they legitimate things that you need to respond to? If so, you should be fixing these issues so that they don't happen again. If anyone gets a page, we make it a high priority to fix whatever caused it. We are a team of 7, and we dedicate one person a week to field questions relating to our platform as well as to fix up these issues that wake us up.
If they're not legitimate things that you need to be woken up for, why are you being woken up? If this is the case, you need to make sure everyone is on the same page regarding what constitutes something you need to be paged for after hours.
So basically, I don't work at companies that make their employees carry a pager etc. Life is too short for that shit.
I worked briefly at a startup shortly after it's acquisition by a FAANG. The startup's code was trash - I acknowledged while on call that I didn't exactly know what was going on after digging a while - asked for help - was then reprimanded for "not knowing the code well enough" basically because I asked for help. I left about a month after that. Again, life is too short for that shit.
Normal shift is like every other day. Just go to work, do my job. Come home, eat, chill a bit and go to sleep.
It used to be more. The company started with three people three years ago (myself included). Now we're over 50. We have enough resources to fix and solve problems before they become real problems.