Do you monitor your REST APIs?

Question

I have a lot of APIs for different apps that I've built for customers.And most of the time it is when the customer calls me I get to know that the API is down!Maybe after some data changes, or some random code changes.I'm now having nightmares about API services going down.Do you continuously monitor your APIs and get notified when something goes down? (It's not just like monitoring a website using Pingdom, but the actual data responses, like for example check if a particular JSON field exists in the response)

davismwfl · Accepted Answer

Yes, I for every API I build I setup a few specific heartbeat style endpoints. I have done this since I was a consultant and even now I do it within the company I am at now.
1. Heartbeat, which checks the service is responding. Used for HTTP monitoring services to make sure the route is up mainly.
2. Heartbeat which checks the service is up and validates all database connections are up and I can get data from the database (usually a trivial query on a small table). Used as the primary detection for internal failures.
3. For any 3rd party services I depend on I setup a heartbeat endpoint for them that will check the service is up (but not necessarily giving me good data). Usually I group them, sometimes I group and separate them under like /heartbeat/services, /heartbeat/service1, /heartbeat/service2. Sometimes you can validate the service is returning good data but not all the time is it easy to do that, so I do what I can.
4. I setup a 3rd party service to monitor the heartbeats and the return code to validate they are up and properly returning what I expect, notify me if not. I don't have to do sophisticated response processing at the 3rd party service because I can just use http return codes 99% of the time. The detailed response checking is done at the heartbeat level, then a response code generated. And of course, any failure to respond shows too.
This is still not perfect, but it has proven to make sure we know before anyone else when something fails. I still have one product that we haven't converted to this process right now but we are migrating to a new version that has these checks so it will help me sleep better the faster that happens.
One key thing is don't make the check interval too crazy, the general http is used a lot for the load balancers, but the others are spread out a lot more to reduce creating artificial load. When we build an independent service (microservice etc) I make sure they have these same checks, although it might not be http based. But since they have the same basic methodology a service watcher can remove any instance from the registry if a check fails after some configured number of failures & retries etc.
*edit a few words

futhey · Answer

Low-tech / small-scale solution: Similar to what you're doing, UptimeRobot lets us monitor and alert on status codes for free, which works for a lot of simple APIs. I also write a few simple tests for my most important API routes (sort of like a heartbeat or a self-check / test) that return a 500 on failure (when health of the actual API might not be surfaced by a simpler test). 50 "tests" for free goes pretty far.I also tried a product recently that I really liked, https://checklyhq.com/ -- They'll give you more advanced ways of vetting your API responses from multiple locations (along with averaging request time and monitoring that).

kccqzy · Answer

Yes and I find it helpful to have both black box monitoring and white box monitoring with my previous experience.
For black box monitoring we just set up a prober that runs periodically and sends requests. It then checks responses to see if they are what is expected. Bonus if you place multiple such probers across the globe and that also exercises your load balancing and tests the geographic replication of your services.
For white box monitoring we instrumented the code itself to export information about events and metrics. For example, application-level things like the metadata of each request and response, response status, time to generate the response, internal errors encountered; system-level things like memory allocation and CPU time for the container; and dependencies like database query times, and the durations and statuses of external requests, etc. We used http://riemann.io/ to collect and process these streams and set up alerts. I find it really powerful to adopt this paradigm where streams of data are exported from your app and processed externally; though getting used to the stream processing mentality could be something extra to learn.

time0ut · Answer

My general approach is to create monitors (in something like Splunk or ELK) that watch logs and fire alerts (email, SMS, PagerDuty, etc) if their conditions are met.
I create monitors for health issues like watching for out of memory or pod failures. I create monitors that compute the error rate and trend for each endpoint and alert if it crosses a threshold. Similarly, I'll create monitors for dead letter queues or email send failures or anything else that might go wrong in an app.
This may sound like a lot of monitors, but I try to log things in common ways, so a handful of monitors can watch hundreds of endpoints or queues.
Finally, for complicated mission critical systems, I build in support for synthetic transactions that avoid undesired side effects. These may generate extra trace logs in the app. Such requests are submitted on a regular schedule and the input and output logged. Then I build more monitors on these logs.

janober · Answer

I use https://n8n.io (full disclosure, I am the creator) for it. It is free and fair-code licensed. I did not write it esp. for this use case but have to say works very great.

zwetan · Answer

Yep, I use Google Analyticsserver-side you can track using the measurement protocolfew years ago I did a prototype test with PHP, example here https://pastebin.com/PQCRcJXqwith something like Slim PHP you can add a middleware and automatically track everything, but you can also customize on a needed basisI use the same logic with different PL on different backends etc.for a starter it is cheap to implement and put in place, and cover almost everything

vivekf · Answer

We have built into a common layer in all our APIs to record the HTTP status code it is returning to a redis counter . We have a monitor job that runs every 1 minute checking the error % ( 200 vs others) and raise an alert when the threshold is exceeded. This way we get to know api failure errors and potential security issues such as http 403 returned %.We also monitor the % requests logged every minute and if that drops by say 50% we know something is down.

nitwit005 · Answer

What I've done in the past is build a library for functional testing of the API. You can use that library writing functional tests, and to create an API status test that runs periodically to provide monitoring.This generally does require that your APIs have some idea of multi-tenancy, as you don't want your tests modifying some customer's data.

googlycooly · Answer

Or is it like, I'm missing an intermediate step that will solve this "monitoring" problem for me?

pieterhg · Answer

Yep I use http://uptimerobot.com for it. It's mostly just keyword alerts where it expects a certain keyword is in the reply. If there isn't, it's probably down and it alerts me. The alerts I get via Telegram.

jozi9 · Answer

Shameless plug but I created an app exactly for this purpose: https://www.apilope.com
You can schedule test flows and validate responses as well.
Edit: my cert is expired, I’ll fix this (now that I have a lot of time to spare:)

nreece · Answer

One of our APIs is powered by a cloud function that we monitor (and keep it warm to avoid cold start time) using https://www.statuscake.com