HACKER Q&A
📣 fasteo

How do you measure and report SLA to customers?


I'm curious to know how you all measure and report SLA metrics to customers. We run a SMS gateway for critical services (2FA, OTP and the like) and we measure our SLA metrics with a bunch of scripts in python and bash. Reporting to customers is just sending them an email with an excel attachment with the computed SLA for a given period (usually monthly)

Is there any frameworks or tools you use to measure and report SLA performance to customers ? Some kind of SLA dashboard.

Note that we use site24x7 for monitoring our services. They do have some SLA reports and dashboard, but I am thinking in specific tools just to ingest metrics and compute SLAs

Any insights you can provide will be greatly appreciated!


  👤 hayst4ck Accepted Answer ✓
You might check out the google SRE handbook (you should probably read the whole thing because you're asking this question): https://sre.google/sre-book/service-level-objectives/

I would consider 1 of 3 approaches (well, 3 of 3 approaches):

1. Use an external service that measures your websites uptime. You should have this, and have it as a backup alerting system in case your monitoring infrastructure fails anyway.

2. Use a time series database (like Prometheus or influx), preferably that gets data from as close to your edge as possible. (vulnerable to reporting/collection failures), use grafana or something like it to make graphs.

3. Ingest events into a data warehouse via a kafka/sqs/durable queue like system and then write queries that output reports. (most accurate, most effort).

For a normal website without anything fancy, taking nginx (or whatever your load balancer/ssl terminations logs are from) logs and ingesting them into a database, then cronning a python script that performs an SLA query and makes it all pretty for an e-mail seems like a fine way to calculate an SLA.

Option 2 is industry standard operations. Option 2 is good enough for paging on, but probably not good enough for a legal relationship. Option 3 is probably as good as you can get for a legal agreement.

Of course whats important to consider is that a customer might be able to measure their SLA, and you should probably have a good answer for any differences.


👤 coreyp_1
Well, you could use your scripts to put the KPIs into a table, and then use a BI tool for your SLA reporting, but that will more than likely get quite expensive.

BI tools such as Looker, Tableau, Jaspersoft, etc., get very pricey very quickly.

If what you are doing is working, then stick with that, unless you absolutely need something more sophisticated.