What are some preventive fault tolerance tools

Question

I'm looking for tools that integrate well with AWS Cloudwatch, Datadog and other telemetry logging systems and can predict errors in the infrastructure before they even happen. Possibly even integrate with Github to get PR data and asses if a deployment might have a high chance of failure. Basically create a time-series like representation of all actions in the infrastructure(Infrastructure As A Code). This means treat every action(Code change, Permission change, deployment, error log) as a first- class object and arrange them in a time series fashion. This will help feed the context to a ChatGPT model to predict what might happen.Do you see the value in this? Or am I crazy? Because when something breaks down, all the teams can have a high level overview of what is happening in the system. The problem with existing logging tools like DataDog is that they have deep understanding of each metrics, but fail to assign severity level to error logs or present a birds eye picture of the whole infra.Disclaimer: We are a VC backed company who wants to pivot in this direction. Your input would be very helpful.

overu589 · Accepted Answer

While playing a modern sandbox adventure game, I became enamored by the logic loop system.
Essentially everything is a little machine with inputs and outputs all responding reactively to world state changes. Everything having an interface and a simple list style prioritization. Clicking on anything will show its last message (such as an error message, or that it is working fine.)
There could be a “global chat” where everything logs (that environment didn’t have a helpful combined log.)
It gave me (a business applications developer) a dream for a simulated environment where our business processes have well defined interfaces (StrictYAML?) and they interact in a world simulation environment (for which inputs and outputs could be interfaces with the real world.
Feel me? So I could write an “agent” (logical not necessarily LLM) which performs a data task (periodic searches for a pattern.) these roll up world state and your watchdog agent alerts when preconditions are met (volume, failure rate, drastic change in some frequency.)
An effective sandbox for simulation and emulation of the business domain. In this environment one could produce workers which search for and respond to any pattern.