And most of the time it is when the customer calls me I get to know that the API is down!
Maybe after some data changes, or some random code changes.
I'm now having nightmares about API services going down.
Do you continuously monitor your APIs and get notified when something goes down? (It's not just like monitoring a website using Pingdom, but the actual data responses, like for example check if a particular JSON field exists in the response)
1. Heartbeat, which checks the service is responding. Used for HTTP monitoring services to make sure the route is up mainly.
2. Heartbeat which checks the service is up and validates all database connections are up and I can get data from the database (usually a trivial query on a small table). Used as the primary detection for internal failures.
3. For any 3rd party services I depend on I setup a heartbeat endpoint for them that will check the service is up (but not necessarily giving me good data). Usually I group them, sometimes I group and separate them under like /heartbeat/services, /heartbeat/service1, /heartbeat/service2. Sometimes you can validate the service is returning good data but not all the time is it easy to do that, so I do what I can.
4. I setup a 3rd party service to monitor the heartbeats and the return code to validate they are up and properly returning what I expect, notify me if not. I don't have to do sophisticated response processing at the 3rd party service because I can just use http return codes 99% of the time. The detailed response checking is done at the heartbeat level, then a response code generated. And of course, any failure to respond shows too.
This is still not perfect, but it has proven to make sure we know before anyone else when something fails. I still have one product that we haven't converted to this process right now but we are migrating to a new version that has these checks so it will help me sleep better the faster that happens.
One key thing is don't make the check interval too crazy, the general http is used a lot for the load balancers, but the others are spread out a lot more to reduce creating artificial load. When we build an independent service (microservice etc) I make sure they have these same checks, although it might not be http based. But since they have the same basic methodology a service watcher can remove any instance from the registry if a check fails after some configured number of failures & retries etc.
*edit a few words
I also tried a product recently that I really liked, https://checklyhq.com/ -- They'll give you more advanced ways of vetting your API responses from multiple locations (along with averaging request time and monitoring that).
For black box monitoring we just set up a prober that runs periodically and sends requests. It then checks responses to see if they are what is expected. Bonus if you place multiple such probers across the globe and that also exercises your load balancing and tests the geographic replication of your services.
For white box monitoring we instrumented the code itself to export information about events and metrics. For example, application-level things like the metadata of each request and response, response status, time to generate the response, internal errors encountered; system-level things like memory allocation and CPU time for the container; and dependencies like database query times, and the durations and statuses of external requests, etc. We used http://riemann.io/ to collect and process these streams and set up alerts. I find it really powerful to adopt this paradigm where streams of data are exported from your app and processed externally; though getting used to the stream processing mentality could be something extra to learn.
I create monitors for health issues like watching for out of memory or pod failures. I create monitors that compute the error rate and trend for each endpoint and alert if it crosses a threshold. Similarly, I'll create monitors for dead letter queues or email send failures or anything else that might go wrong in an app.
This may sound like a lot of monitors, but I try to log things in common ways, so a handful of monitors can watch hundreds of endpoints or queues.
Finally, for complicated mission critical systems, I build in support for synthetic transactions that avoid undesired side effects. These may generate extra trace logs in the app. Such requests are submitted on a regular schedule and the input and output logged. Then I build more monitors on these logs.
server-side you can track using the measurement protocol
few years ago I did a prototype test with PHP, example here https://pastebin.com/PQCRcJXq
with something like Slim PHP you can add a middleware and automatically track everything, but you can also customize on a needed basis
I use the same logic with different PL on different backends etc.
for a starter it is cheap to implement and put in place, and cover almost everything
We also monitor the % requests logged every minute and if that drops by say 50% we know something is down.
This generally does require that your APIs have some idea of multi-tenancy, as you don't want your tests modifying some customer's data.
You can schedule test flows and validate responses as well.
Edit: my cert is expired, I’ll fix this (now that I have a lot of time to spare:)