I'm writing up some of my thoughts on what makes a good infrastructure monitoring and alert system. Do you have any thoughts based on your experience?
Actionable: I should only be alerted if there is a definitive action to take. Clearing up disk space, investigating suspicious activity, replacing a hard drive or BBU, fixing connectivity issues. If an alert clears on it's own, that probably indicates a problem with monitoring frequency or thresholds. If an outage occurs without alerting, ditto. If an alert gets ignored, it probably indicates that there are too many non-actionable alerts.
Minimal: Reduce alerting noise. Actionable takes care of a lot of that, but parenting relationships can reduce a dozen or a hundred alerts to one. "SQL database is down" rather than 100 "web frontend timeout", for example.
Nagging: An alert should go to one person, in a way that they are sure to see it. Bypass DND settings, text, notification, call. That person should either acknowledge and work it, or the next person in the on-call rotation should get naggged. Until someone owns it. I used to have an Android app that I wrote that would ring my phone on loud every 15 seconds if I had a missed call or text, for example. These days, I'm more relying on pings every 15 minutes if the alert is not acked or resolved.