Are you running different regions as well, to check your availability from multiple places?
My cheapness motivates me to only check every 15-20 minutes, and ideally rotate geography so, check 1 fires from EMEA, check 2 from LATAM, every geo is checked once an hour. But then I think about my boss calling me and saying 'we were down for all our German users for 45 minutes, why didn't we detect this?'
Changes in these settings have major effects on billing, with a 'few times a day' costing basically nothing, and an 'every five minutes, every region' check costing up to $10k a month.
I'd like to know what settings you're using, and if you don't mind sharing what industry you work in. In my own experience fintech has way different expectations from e-commerce.
We overlap “peak” and “non-peak” schedules. The peak one runs during work hours every minute. The “non-peak” ones run 24/7 (no easy way to exclude the time periods that peak runs, otherwise we’d exclude this) but more infrequently, every five minutes.
We use Datadog, but try and be fairly barebones about it because it’s really expensive at scale and we only rarely actually catch issues with them because our CI process has gotten much better. We also used Checkly in the past and had no real complaints, just switched to Datadog to keep everything in one place.
Can you tell us what you mean more specifically? Is this something like a Kubernetes readiness probe which you can every few seconds, or something more complex that interacts with multiple systems?