Any books/resources on logging and monitoring?

Question

I&rsquo;ve been running a few distributed web services, but except for a few rudimentary nginx access logs, I have no idea how they&rsquo;re functioning.I have the following goals and questions regarding implementing a logging & monitoring system to get better insights of them:- What are the best practices to instrument source code to collect general logs and exceptions? - How to determine if the services and databases are performing efficiently? More specifically, what I can do to discover if they are doing unnecessary work or there are any hotspots? - Are the servers being run on overloaded? If so, what are overloading them? - How do I know if some one is trying to break into the servers? - How can I be alerted whenever a bad thing previously mentioned happens?And then there is the business logic side of things. like how many users are online, how many transactions are currently being processed, etc. I don&rsquo;t suppose directly querying the production database is a good idea.My own research online surfaced a great deals of tools like prometheus, ELK stack, fluentd, Nagios, bugsnag, New Relic, Datadog, etc, which overwhelmed me, and I reckon without a good understanding of logging and monitoring in general, I&rsquo;m likely to pick the wrong tools.This feels like a really big topic. Any books/resources that have a comprehensive introduction?

sid- · Accepted Answer

https://dzone.com/articles/distributed-tracing-with-zipkin-a... https://github.com/jaegertracing/jaeger' A quick search yielded these interesting projects.

jdale27 · Answer

The Google SRE book (online here: https://landing.google.com/sre/sre-book/toc/index.html) might be useful. Specifically chapters 6 and 10 on monitoring and alerting.