Is there any frameworks or tools you use to measure and report SLA performance to customers ? Some kind of SLA dashboard.
Note that we use site24x7 for monitoring our services. They do have some SLA reports and dashboard, but I am thinking in specific tools just to ingest metrics and compute SLAs
Any insights you can provide will be greatly appreciated!
I would consider 1 of 3 approaches (well, 3 of 3 approaches):
1. Use an external service that measures your websites uptime. You should have this, and have it as a backup alerting system in case your monitoring infrastructure fails anyway.
2. Use a time series database (like Prometheus or influx), preferably that gets data from as close to your edge as possible. (vulnerable to reporting/collection failures), use grafana or something like it to make graphs.
3. Ingest events into a data warehouse via a kafka/sqs/durable queue like system and then write queries that output reports. (most accurate, most effort).
For a normal website without anything fancy, taking nginx (or whatever your load balancer/ssl terminations logs are from) logs and ingesting them into a database, then cronning a python script that performs an SLA query and makes it all pretty for an e-mail seems like a fine way to calculate an SLA.
Option 2 is industry standard operations. Option 2 is good enough for paging on, but probably not good enough for a legal relationship. Option 3 is probably as good as you can get for a legal agreement.
Of course whats important to consider is that a customer might be able to measure their SLA, and you should probably have a good answer for any differences.
BI tools such as Looker, Tableau, Jaspersoft, etc., get very pricey very quickly.
If what you are doing is working, then stick with that, unless you absolutely need something more sophisticated.