HACKER Q&A
📣 serverlessmom

How often do you run heartbeat checks?


Call them Synthetic user tests, call them 'pingers,' call them what you will, what I want to know is how often you run these checks. Every minute, every five minutes, every 12 hours?

Are you running different regions as well, to check your availability from multiple places?

My cheapness motivates me to only check every 15-20 minutes, and ideally rotate geography so, check 1 fires from EMEA, check 2 from LATAM, every geo is checked once an hour. But then I think about my boss calling me and saying 'we were down for all our German users for 45 minutes, why didn't we detect this?'

Changes in these settings have major effects on billing, with a 'few times a day' costing basically nothing, and an 'every five minutes, every region' check costing up to $10k a month.

I'd like to know what settings you're using, and if you don't mind sharing what industry you work in. In my own experience fintech has way different expectations from e-commerce.


  👤 andenacitelli Accepted Answer ✓
We run our most important ones (main api endpoint health check, homepage browser check) anywhere from every minute to every five minutes depending on time of day. We only use one region because our users are mostly in the US. Less vital but still important ones are every fifteen minutes or so.

We overlap “peak” and “non-peak” schedules. The peak one runs during work hours every minute. The “non-peak” ones run 24/7 (no easy way to exclude the time periods that peak runs, otherwise we’d exclude this) but more infrequently, every five minutes.

We use Datadog, but try and be fairly barebones about it because it’s really expensive at scale and we only rarely actually catch issues with them because our CI process has gotten much better. We also used Checkly in the past and had no real complaints, just switched to Datadog to keep everything in one place.


👤 sk11001
> Call them Synthetic user tests, call them 'pingers,' call them what you will

Can you tell us what you mean more specifically? Is this something like a Kubernetes readiness probe which you can every few seconds, or something more complex that interacts with multiple systems?


👤 breckenedge
Look at it another way: How much do you stand to lose given X downtime? Spend what it takes to know when you’ve crossed that threshold.

👤 loeber
Every minute. I like betteruptime.com.