HACKER Q&A
📣 lovelearning

Observability pattern to reduce mismatches betn reality and status page?


Reddit's non-functional from about 1 1/2 hours. Keeps showing "Our CDN was unable to reach our servers".

However, redditstatus.com shows no incident reports and shows error rate as having reduced.

Because it's just Reddit, it doesn't matter.

But it would look bad if this was from a commercial SaaS. Comes across as either shoddy or dishonest.

Every SaaS today depends on lots of 3rd party infra - CDN, DNS, AWS, payment gateways, search engines, and more.

My question: Are there any architectural patterns in monitoring / observability to reduce such mismatches between reality from user's perspective and status pages from service's perspective?


  👤 anoophallur Accepted Answer ✓
One pattern I've seen in use is to run synthetic tests (we have a tool similar to new relic synthetic testing suite) from multiple DC's/regions.

The trick in this pattern is to reduce false negatives as a lot of tests can get flaky (for e.g due to network congestion temporarily). Failures are retried with exponential backoff and backs off after ~10 minutes at which point status is shown as down. This is shown as external status page.

Internally, this metric (availability) is broken down into availability with dependencies (external) and availability without dependencies (internal). For both availability drops, on-call engineer gets alerted and if it's down due to dependency, we pass it on to the dependency to fix the issue, while we work on backups. At this point it's a manual breakglass fix and multiple teams are alerted.

For internal availability loss, we do a similar procedure (i.e manually fix) but identify the fix and get it out ASAP.

I've seen it happen only once in a last year and thats due to an internal bug, but yeah, given the nature of too many 3rd party infra, this is a tricky problem to solve.

Obviously not perfect