1. debugging pod crashloopbackoff
2. Checking events, logs, labels, pods, services
3. Looking for older events which are gone :(
4. Frequently logging into cloud providers dashboard to figure out if there is any issue with cloud provider.
5. Traffic is not being received by downstream applications
6. Ensuring services are selecting right pods
7. Launching pod to execute curl/dnsutils/awscli
8. For externally exposed service figuring out if ingress is routing traffic correctly, there isn’t other config superseding it
9. Doing exec into pod to check configmap/secret changes are reflected or not or killing the pod if feeling too lazy to check
10. Figuring out why node is not ready
11. Checking RAM and CPU utilization
12. Figuring out how this application is deployed: helm , argocd, flux, tecton, wf
13. Checking if manifest has changed recently and comparing for manifest misconfiguration
14. Comparing manifest with other env manifest to be sure if new config parameter has not been missed
15. Building mental model for applications context boundary
Have you felt the same? I wanted to automate it, which feature should i implement first?
This might be helpful:
https://learnk8s.io/troubleshooting-deployments
I am planning to do the same for a platform that I am building, and is deployed on prem. Let me know if this is an open source project.