Use structured logging (JSON with timestamp, trace ID, source, body, log level).
Use trace IDs.
Don’t log too much.
Log enough to reproduce issues—body should include relevant information, most likely the input parameters to the function doing the logging.
Use the right log store—probably ElasticSearch based.
Don’t keep too many old logs—accumulate statistics and delete the logs themselves after 30-90 days.
Now, get better at analyzing your logs.
Understand and use the query language your logging system uses to produce efficient queries.
Do a quick query on a small time range to quickly test a hypothesis, before running your query against longer time ranges that will take more time to complete.
Understand all the kinds of messages your logs are emitting and what issues those messages point to—you should be able to grep “log” and get a list of all the log messages that might be used.
Identify the time range that you’re interested in based on when your issue started.
Find any system changes (deploys, config changes, traffic spikes) that occurred upstream of the issue based on timing.
Measure base rates and discover anomalies.
Use the trace IDs to put together the story of requests that failed.
Keep detailed notes about your investigation so that you can explore and backtrack different possibilities.
Write down the root cause as a tree of “why’s”.
Make your fix and check!