HACKER Q&A
📣 antocv

DevOps, why do people still use Grafana/Prometheus etc.?


What is the point of "monitoring", setting up a fancy dahsboard showing some graphs of some time-series data?

Ive seen this used only to impress management and to get a "star treky" look in the office. But no body actually stands and looks at a graph as their day-job, nor should they. Nor do alarms go out to people from the Grafana dashboard.

Here is the thing, if you can have alarms go out that something is wrong, you have that then why do you need to see that on a graph?

I really dont see the point of "monitoring solutions" when any actionable event (even if it is generated by "interpreting time-series data by 'machine learning'", can just be an actionable event without showing stuff on a dashboard.

Enlighten me devops monitoring folks please?


  👤 core-questions Accepted Answer ✓
> But no body actually stands and looks at a graph as their day-job, nor should they.

That's very interesting. Where I work, looking at these graphs is part of our day to day duties: we need to know how our systems are doing, and alerting conditions have not yet been defined to cover every possible thing that could go wrong. Typically, the graphs help us point out conditions we need to watch for, which are often a combination of multiple things happening at the same time which necessitate human action to resolve.

Graphs also help us on the business side because we can see what happens with the utilization of our service after various marketing efforts, launches, promotions, etc. and while there are BI tools to do this, they often suck in many ways compared to Grafana, so it's usually a better bet to just stick in a place where the data can be viewed idiomatically.

Last but not least, without the ability to look at graphs, how do you know all of your monitoring is working and configured well? It's not enough to just have blind faith in the system, you need to check ever so often to make sure things are flowing well.

It's not about it being a fancy star trek dash, but damn if that doesn't impress management anyway.


👤 a-saleh
I have been on L2 and L3 on-call duty in my previous company, and some of the times these dashboards were life-savers.

Is the queue growing? Certain node not processing all it should? e.t.c

Alerts actually did go out of prometheus.

Hindsight is 20/20 and I remember i.e. having to change alerting on median aggregate node memory spiking to alert on any of the nodes spiking. Seems obvious in retrospect, but if we only have alert, I don't really know.

And there were parts where the dashboard wasn't obvious in retrospect, in the more complicated parts of our data-pipeline,ad there Graphana really shines.


👤 ahpearce
As others have said, there may not be an 'event'. Some metrics need to be monitored manually before setting up an event to trigger. Sure, you might have engineers analyzing the time series data, but you also need to keep your systems up. There are multiple failure modes for various services that may require different action.

For example, perhaps you have some poorly written legacy service that has a memory leak. Let's just say for the sake of this argument, that any sort of boolean indicators (e.g. checking if the process is running) will give you an 'Okay' or green. You are still probably interested in monitoring the memory usage to make sure the service is operating correctly and/or performant. After monitoring for some time, maybe you determine your ops engineers are taking some action whenever the memory gets around 80% or something... then you can setup the trigger event. But without that manual monitoring upfront, you can't just magically set that threshold.


👤 2rsf
> if you can have alarms go out that something is wrong, you have that then why do you need to see that on a graph?

Because the trend the lead to the alarm might be important and only visible as a graph, for example when did the queue started growing ?

Because you want to fine tune your alarms, or handle near-misses again those are best visible on a graph.

Or because you want to quickly see some high level behavior of the system, number of users per hour of the day, errors vs number of users etc.


👤 PaulHoule
It depends what you are debugging.

I have a smart home project at home and I learn frequently that this is something that normal people will fail at.

I have a sengled switch that connects to a smartthings hub, calls a lambda function, posts a message to an SQS queue which my home server drains and pushes into rabbitmq.

I found that if I didn't use the switch for a while (say hours) I would push it and wait 20 seconds or more for it to turn on (maddening because you might not have faith that it will change which will make you push the button and send more events...) It was reliable, but slow.

I got timestamps from as many parts of the system that I could, made graphs, and that led me to add a heartbeat that kept the lambda and queue active and also to switch from a fifo queue to an ordinary queue. Between those two steps the time from Smartthings to activation is in the 200-300 ms, and with the light configured to turn on instantly instead of fade, it feels responsive.

Note though I was not using Grafana or a tool like that, rather I was working w/ Jupyter and Pandas. After the system has run for a few weeks I might be able to do a detailed analysis of the tail latency, but it's not a "dashboard" I run over and over again unless the problem recurs.


👤 tannerbrockwell
Implementing monitoring of key metrics is a requirement for establishing SLOs. A dashboard may not be observed constantly, and if you have an incident, those key metrics better be presented in a manner that shows realtime, and historic performance. Prometheus and Grafana are prevalent because they are robust and mature implementations. You are correct that dashboards more often than not are to showoff this capability, and remember I said that you MUST implement monitoring if you have SLOs.

Think of the Dashboard as a Cherry on the cake. There isn't much point to the Cherry, but if you bought a cake you better get a cake!

My biggest complaint for dashboarding is that it is easy to ignore some key components, such as nth percentile beyond 95 which is mostly a capacity planning target. If you go looking you will find that there are serious issues in serving your 96-99th percentiles. If you are looking for something to improve start there.


👤 3minus1
Dashboards are great for correlation. After an alarm fires you check the dashboard and compare all the graphs at the time of the incident. It's a great way to get more information about what is and isn't broken.

👤 oftenwrong
If your alarms always go out when something is wrong, you do not need a graph.

However, alarms are never perfect. Issues can occur without triggering alarms at all. The graphs cover your blind spots; they help you debug the issues that you do not yet have reliable alarms for.

If an issue occurs without triggering an alarm, but the graphs help you debug the issue, then you should create an alarm that would have caught the issue. Next time, you will get the alarm, and will have a better idea of what is happening without having to look at the graphs.