The basic high-level pattern for self-healing is a process that "watches" the status of other processes and restarts failed ones:
- a "coordinator" process that acts as a "monitor" of other work processes and does regular health check pings. If work process isn't responding, the coordinator/monitor kills/restarts it.
- the work process are engineered to write "checkpoints" of status progress (status file on disk or entry to db, etc). They can have a thread that responds to network pings from coordinator
A lot of systems use the pattern above. E.g. Oracle RDBMS has "PROCMON" process (literally "PROCess MONitor") to look for hung SQL query processes. Erlang/OTP has a "supervisor" that kills/restarts processes. Kubernetes orchestration has concept of "live probes" of the container work processes and restarts broken ones.
That coordinator-process-and-worker-process pattern can be nested into multi-level hierarchies. Inside a single server is a coordinator-process-and-worker-process pattern -- but there's another data-center-level coordinator-process-and-worker-process pattern that watches all the servers.
Also, "Self-Healing" is a subtopic of "Fault Tolerance" so you'd get some more hits by searching for "fault tolerance". I put some links of writings I found helpful on that: https://news.ycombinator.com/item?id=33954078
Be constantly asking yourself: is this the simplest way to do this? Is there a simpler way? e.g. Do you need a database? Do you need a server at all? How far can you get without these things? On the other hand maybe you're looking at newer serverless architectures? Is it really simpler to do that than use a VPS?
Be constantly explaining to yourself why the way you're doing it is the simplest and most robust way.
You can always just hard-restart an app every minute or so to be resilient to nearly any failure condition in runtime (not that I've _ever_ done that before...), but if the data gets into an invalid state you're stuck.
The one time I needed extreme resiliency and recoverability I used a write-only DB with a materialized view which updated on change or startup, and every write in a transaction. I also tailed the DB updates to a file on disk which replicated regularly off-site. It could automatically recover from nearly anything, and was remarkably easy to set up. The hard part was the materialized view, but I "needed" that anyways as I wanted to keep a full audit log as the primary db.
What constitutes resilient data is going to be unique to your use case of course, but consider resiliency from the DB up.
Also I suggest investing heavily in grokkable and relevant runtime observability. (Don't just emit inline comments, put some thought into relevant data and alerts.). Often you'll see a failure coming days ahead of it causing a problem, and you won't need the app to self-heal.
- idempotency / consider all state as a cost
- start off with some chaos monkey testing - e.g. establish automatic regular restarts/re-deploys from the start
Both are no silver bullet but ensure that you won't have too much anxiety and can act robustly when something goes wrong. It also has a host of positive downstream effects, such as facilitating setup of new machines and/or scaling.
Why Do Computers Stop and What Can Be Done About It? (1985) https://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf
Their NonStop Architecture https://www.hpl.hp.com/techreports/tandem/TR-86.2.pdf
QNX's systems are ultra-reliable, too https://cseweb.ucsd.edu/~voelker/cse221/papers/qnx-paper92.p...
OpenVMS clusters' uptime was years to decades & mixed CPU ISA's https://en.wikipedia.org/wiki/VMScluster
Systems that run forever and self-heal (Armstrong) https://www.youtube.com/watch?v=cNICGEwmXLU
Microreboots https://dslab.epfl.ch/pubs/microreboot.pdf
Minix 3: Microkernel-based, self-healing, POSIX OS https://www.youtube.com/watch?v=bx3KuE7UjGA
12 factor apps is a good starting point in general: https://12factor.net/
Crash early, make sure your application can recover from crashes or at the very least a crash caused by one client shouldn't affect any others.
Break it up into many completely independent components. In my case, this is in the form of scripts which read from and write to queues. Each script reads from one queue and writes to another.
A lot of redundancy. From one queue to the next, there are at least two pathways which the data can take.
Lots of sanity checks. Wherever you are taking input, check that it has the expected format, shape, and content before processing it.
More redundancy. Write two versions of the same script in two different languages and make the system run them side by side and compare the outputs. If the outputs differ, there is a problem, and you should switch to the script which produces the correct expected output (and alert the operator.)
Avoid doing dangerous things. For example, querying the database using freeform strings is dangerous. So only query the database using sanity-checked identifiers which contain a predefined list of allowed characters which do not include quotes or anything else weird. Running scripts as a direct result of user request is dangerous, so serve only static HTML as much as possible. And so on.
2. Use a managed cloud orchestration system. Autoscaling groups, AWS ECS Fargate, App runner frameworks, Serverless, Managed K8s.
3. Run operations from chatops, gitops, or a web UI. By making operations work over a remotely accessible communications tool, you can make changes from anywhere, anytime, and never have to deal with local environment setup or resource constraints.
4. Do development in the cloud (Cloud Shell, Codespaces, DevSpace, etc). Same rationale as #3.
5. Mahe everything as immutable as possible. If something fails, throw it away and replace it with a known good artifact.
If you can't run it from an API and pay someone to manage it, it's a waste of your time and money and not reliable. Don't be your own car mechanic for your business; lease a truck. If it ever takes off, you can easily give others access. And you can use this pattern for anything.
Next step is introspection: aggregate monitoring and enough detail to figure out if there are issues.
Next step is being notified when things break. I.e. anomaly detection and alerting.
Then, debuggability. Enough detail to solve issues. Disaster recovery testing is part of ensuring you actually have this, and not just believe you do.
Aside from that, there's CI/CD, automated scaling, automated isolation of bad actors. There are so many things one could do, but this also depends on how large the team is. I'll argue that this type of automation isn't that important if it's just one person.
The SRE book(s) [1] contain many of these high-level ideas. Don't try to do them all at once. :) (Bias: Niall, one of the editors, was my manager when I joined Google SRE.)
If you write software from the beginning as if the only way to exit them is with SIGKILL, and if you make the application crash itself on any sign of fatal error, you get a reliable system.
1. Identify the state of your app (starting, running, ready, etc).
2. Stop traffic once the server goes down.
Very often you'll have state confusion (sigterm triggered, but server kept accepting requests). Make sure your signal handling works well across both your ingress, and the server.
A nice and easy hack is to have a /status endpoint in all of your apps that returns:
1. the current commit deployed
2. the availability of dependencies (db reachable? db connected? any missing environment variables?).
3. Which instance/pod/server is serving this request. (Just returning hostname typically works)
At a high-level, every operation had a try-catch that caught all exceptions. (This is similar to panic-resume in Go, but the semantics are very different.) A lot of operations fail on edge cases that you can never fully anticipate. (We had to deal with oddball network errors that wouldn't reproduce in our development/test environment, oddball errors caused by 3rd party applications that we didn't have, ect, ect.) It's important to have a good failure model...
... Which comes to exponential retry: Basically, for operations that could fail on corner cases, but essentially "had" to work, we'd retry with an exponentially-increasing delay. First retry after 1 second, then 2 seconds, then 4 seconds... Eventually we'd cap the delay at 15 minutes. This is important because sometimes a bug or other unpredictable situation will prevent an operation from completing, but we don't want to spam a server or gobble up CPU.
Try to make almost all operations transactional. (Either they succeed or fail, but they never happen in an incomplete / corrupted manner.) You can get that "for free" when you use a SQL database: We used SQLite for local state and almost exclusively used SQL in the server. For files that stored human-readable (QA, debugging) XML/JSON, we wrote them in a transactional manner: Rename the existing file, write the new version, delete the old version. When reading, if the old version existed, discard the new version and read the old one. We also implemented transactional memory techniques so that code wouldn't see failed operations.
Finally: Concurrency (threading) bugs are very hard to find, because they tend to pop up randomly and aren't easily reproducible. The best way to do concurrency is to not do it all. If you can make your whole application single threaded and queue operations, you won't have concurrency bugs. If you have to do concurrency, make sure you understand techniques like immutability, read/write locking, lock ordering, and reducing the total number of things you need to lock on. Techniques like compare-exchange allow you to write multithreaded code that doesn't lock/deadlock. Immutability allows you to have non-blocking readers, if readers can tolerate stale state.
I don’t have a great resource of these frameworks or patterns but my approach has been to learn from the characteristics of historically successful systems that I want similar capabilities to (if I can’t use the framework directly)
Restart=always
RestartSec=30
is enough for 99% of the small appsIf you can get away with VPS(es) and cloud hosted DB that will be by far the simplest solution to manage. Or "serverless" if service is small enough and doesn't need to be persistently running.
Once it starts bringing actual money you might then start thinking about dockers, k8s and other fancy stuff
We use AWS Elastic Container Service (ECS) to ensure all of the services are running and to scale when necessary. We use NSQ to make sure tasks are sent and re-sent to workers until they complete the entire pipeline. And we use Redis to store interim processing data from each worker.
Keeping that interim data (state) in a place where all workers can access it is key. ECS kills workers when it wants, and workers occasionally die due to out-of-memory exceptions. With the state info stored in Redis, a new worker can pick up any failed task where the last worker left off.
This system has been running well in production under heavy load. It replaces an older server-based system that used different technologies to handle the same responsibilities: supervisord to keep processes running and BoltDB to store interim processing data. That system worked, but could not scale horizontally because BoltDB is a local, disk-based store. Distributed workers need a shared, network-accessible store to share state info.
You'll find a detailed overview with diagrams of the new system at https://aptrust.github.io/preserv-docs/overview/
There's a shorter write-up of the goals and how we achieved them at https://sheprador.com/2022/12/architecting-for-the-cloud/
This stuff isn't too hard, as long as you get the pieces right. Try to stick to fredley's advice. Keep things simple and you'll save yourself many headaches going forward. Also, be sure your workers handle SIGTERM and SIGKILL explicitly, cleaning up their work as much as possible before they die.