1) how do I handle multiple same requests
2) What happens if database fails
3) What happens if server fails
4) How do I work with time if users are from different timezones
Etc
Do you have some kind of cheat sheet?
How does it fail? Are there enough metrics, structured logs, and/or tracing present to create monitoring and help dig into issues. Do the logs have the relevant information to reproduce the condition? Any network access is a potential failure point. Any disk access is a potential failure point. Capture any and all errors. I never want something to silently fail.
How does it scale? Monitoring machine metrics (cpu, mem, disk util, network util), monitoring responsiveness (timing metrics on network and expensive operations). Identify bottlenecks and run profilers.
The other part I ask myself is how I can make the tests better, the code more readable, the documentation more useful, the dashboards more actionable (do they tell a story to someone new to the team that will help them debug), the runbooks more clear on working with alerts, better content in the alert itself including links to runbooks (a favor to anyone on call at 2am).
Could I make more money at another job?
Why didn't I take the blue pill?