Sounds like you are unhappy with the quality of code and error handling. Seems like it was written by junior devs? I would recommend having the team read the usually recommended books if they haven't. At the bare minimun:
* Refactoring by Fowler
* Clean Code by Robert Martin
If you are using an OO language:
* Practical Object-Oriented Design in Ruby
The book has "in Ruby" in the title but it's a general purpose book on what make OO design "good"
Then follow it up with
* Patterns of Enterprise Architecture
* Clean Architecture
These books are not perfect or the be all end all, there are parts of them that might be slightly dated but they get you to a large chunk of the way to the promised land.
Lastly, if the application is working, the users are happy, there is no bug infestation and you are not having issues with releasing new features, don't feel the pressure to immediately "fix" the code.
Take a look at Sentry[0]. It will catch exceptions, group them by type, count them, display them with the stack trace, and more. It has integrations with Slack, GitLab/GitHub, etc. It makes creating issues and alerting you easier. You won't have to dig into dozens of log files and miss exceptions that happened a few hours ago but are completely drowned by log messages.
Take a look at Prometheus[1] and Grafana[2]: you can set them up and have the state of your infrastructure in a dashboard (services up or down, how many times they were down, for how long, etc. Latency, CPU, GPU, disk space, RAM, etc). Whatever matters to you, put it there. You can create custom alerts (example: storage 90% full) to give yourself a heads up and act beforehand.
Look at whatever you do as a team when there's something wrong, figure out all the information you usually need to troubleshoot, and have it all displayed so you can glance at something and know quickly where something is wrong. This is reactive, but it's a start and you won't waste
For development, you can have issue templates for incidents [service shut down and nobody saw it, etc], and bugs to lower the barrier to entry to write good quality incident reports. This way, people know what to write, where to write it, and how to write it. Put a tag in the template, the people in CC, whatever makes your life easier. Summary, impact, recovery, investigation, future prevention.
One benefit of that is that when you have these incident reports, patterns will emerge fast. It surfaces the most frequent and the most impactful pretty quickly.
Once this is done, it will save you time and effort that you can put into reading more on the subject. Search for "Site Reliability Engineering", or "SRE"[3]. There are a few books, some more abstract and others more practical[4][5].
Take a look at Enterprise Ready[6]. It talks about the most common requirements and features in an enterprise product (SSO, RBAC, etc).
- [0]: https://sentry.io
- [1]: https://prometheus.io/
- [2]: https://grafana.com/
- [3]: https://en.wikipedia.org/wiki/Site_reliability_engineering
- [4]: https://sre.google/books/
- [5]: "Seeking SRE, Conversations about Running Production Systems at Scale"