HACKER Q&A
📣 KingOfCoders

Best practices for self-healing apps?


What are your best practices for self-healing apps for low maintenance of servers for a solo founder? I use go but general patterns are welcome.


  👤 jasode Accepted Answer ✓
>I use go but general patterns are welcome.

The basic high-level pattern for self-healing is a process that "watches" the status of other processes and restarts failed ones:

- a "coordinator" process that acts as a "monitor" of other work processes and does regular health check pings. If work process isn't responding, the coordinator/monitor kills/restarts it.

- the work process are engineered to write "checkpoints" of status progress (status file on disk or entry to db, etc). They can have a thread that responds to network pings from coordinator

A lot of systems use the pattern above. E.g. Oracle RDBMS has "PROCMON" process (literally "PROCess MONitor") to look for hung SQL query processes. Erlang/OTP has a "supervisor" that kills/restarts processes. Kubernetes orchestration has concept of "live probes" of the container work processes and restarts broken ones.

That coordinator-process-and-worker-process pattern can be nested into multi-level hierarchies. Inside a single server is a coordinator-process-and-worker-process pattern -- but there's another data-center-level coordinator-process-and-worker-process pattern that watches all the servers.

Also, "Self-Healing" is a subtopic of "Fault Tolerance" so you'd get some more hits by searching for "fault tolerance". I put some links of writings I found helpful on that: https://news.ycombinator.com/item?id=33954078


👤 fredley
The three most important things are: simplicity, simplicity and simplicity. Things that are more simple have less ways they can break, and if they do they're easier to reason about.

Be constantly asking yourself: is this the simplest way to do this? Is there a simpler way? e.g. Do you need a database? Do you need a server at all? How far can you get without these things? On the other hand maybe you're looking at newer serverless architectures? Is it really simpler to do that than use a VPS?

Be constantly explaining to yourself why the way you're doing it is the simplest and most robust way.


👤 residualmind
One thing I've noticed that is sometimes forgotten, especially at earlier stages is monitoring. You want to know how much self healing is actually happening. Let's say you have your self-healing system in place, say some k8s pods combined in a service with a little redundancy and very little state. Pods happily crash, another one takes over while a new one spins up. All is wonderful and you don't worry about your availability anymore because everything just always works. One day you decide to look into whats happening in your containers and are shocked because one pod crashes every 0.3 seconds. It just spins up, answers 1 request but then dies and a new one spins up...continuously. From the outside everything looks kind of ok but in reality you are wasting massive resources and have a nasty bug that might be losing you even data, consistency, creating load, etc... Some sort of monitoring is a good idea is what I'm saying.

👤 DelaneyM
Something not mentioned yet (at time of writing) is the importance of self-healing data.

You can always just hard-restart an app every minute or so to be resilient to nearly any failure condition in runtime (not that I've _ever_ done that before...), but if the data gets into an invalid state you're stuck.

The one time I needed extreme resiliency and recoverability I used a write-only DB with a materialized view which updated on change or startup, and every write in a transaction. I also tailed the DB updates to a file on disk which replicated regularly off-site. It could automatically recover from nearly anything, and was remarkably easy to set up. The hard part was the materialized view, but I "needed" that anyways as I wanted to keep a full audit log as the primary db.

What constitutes resilient data is going to be unique to your use case of course, but consider resiliency from the DB up.

Also I suggest investing heavily in grokkable and relevant runtime observability. (Don't just emit inline comments, put some thought into relevant data and alerts.). Often you'll see a failure coming days ahead of it causing a problem, and you won't need the app to self-heal.


👤 amelius
I just give GPT3 access to the bash prompt, so any problems that occur on my servers will be solved eventually.

👤 uniqueuid
That's really a lot of ground to cover, but two simple recommendations would be:

- idempotency / consider all state as a cost

- start off with some chaos monkey testing - e.g. establish automatic regular restarts/re-deploys from the start

Both are no silver bullet but ensure that you won't have too much anxiety and can act robustly when something goes wrong. It also has a host of positive downstream effects, such as facilitating setup of new machines and/or scaling.


👤 elitepleb
consider OTP with it's "let it crash" design there's a nice go port for that: https://github.com/ergo-services

👤 nickpsecurity
Here's you some examples to learn from. Most are old enough that patents have expired:

Why Do Computers Stop and What Can Be Done About It? (1985) https://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf

Their NonStop Architecture https://www.hpl.hp.com/techreports/tandem/TR-86.2.pdf

QNX's systems are ultra-reliable, too https://cseweb.ucsd.edu/~voelker/cse221/papers/qnx-paper92.p...

OpenVMS clusters' uptime was years to decades & mixed CPU ISA's https://en.wikipedia.org/wiki/VMScluster

Systems that run forever and self-heal (Armstrong) https://www.youtube.com/watch?v=cNICGEwmXLU

Microreboots https://dslab.epfl.ch/pubs/microreboot.pdf

Minix 3: Microkernel-based, self-healing, POSIX OS https://www.youtube.com/watch?v=bx3KuE7UjGA


👤 iuafhiuah
Although these notes from Heroku are relatively old now, following all of the Twelve-Factor rules will get your code and associated pipelines most of the way to a place where self-healing is free by design.

https://12factor.net/


👤 theshrike79
Use managed services as much as possible: AWS Lambda, DynamoDB, S3 etc, they practically never go down. And if they do, most of the world is down too, so you're not alone.

12 factor apps is a good starting point in general: https://12factor.net/

Crash early, make sure your application can recover from crashes or at the very least a crash caused by one client shouldn't affect any others.


👤 forgotmypw17
I've been working on an application with such properties. Here are some patterns I have employed:

Break it up into many completely independent components. In my case, this is in the form of scripts which read from and write to queues. Each script reads from one queue and writes to another.

A lot of redundancy. From one queue to the next, there are at least two pathways which the data can take.

Lots of sanity checks. Wherever you are taking input, check that it has the expected format, shape, and content before processing it.

More redundancy. Write two versions of the same script in two different languages and make the system run them side by side and compare the outputs. If the outputs differ, there is a problem, and you should switch to the script which produces the correct expected output (and alert the operator.)

Avoid doing dangerous things. For example, querying the database using freeform strings is dangerous. So only query the database using sanity-checked identifiers which contain a predefined list of allowed characters which do not include quotes or anything else weird. Running scripts as a direct result of user request is dangerous, so serve only static HTML as much as possible. And so on.


👤 throwawaaarrgh
1. Use managed cloud based APIs. S3 for object store, k/v distributed store for more high performance/smaller records, authenticating APIs (from any cloud vendor, like Auth0), etc.

2. Use a managed cloud orchestration system. Autoscaling groups, AWS ECS Fargate, App runner frameworks, Serverless, Managed K8s.

3. Run operations from chatops, gitops, or a web UI. By making operations work over a remotely accessible communications tool, you can make changes from anywhere, anytime, and never have to deal with local environment setup or resource constraints.

4. Do development in the cloud (Cloud Shell, Codespaces, DevSpace, etc). Same rationale as #3.

5. Mahe everything as immutable as possible. If something fails, throw it away and replace it with a known good artifact.

If you can't run it from an API and pay someone to manage it, it's a waste of your time and money and not reliable. Don't be your own car mechanic for your business; lease a truck. If it ever takes off, you can easily give others access. And you can use this pattern for anything.


👤 tommiegannert
First step is redundancy: having backups, failover, overprovisioning. Essentially prepared "plan Bs".

Next step is introspection: aggregate monitoring and enough detail to figure out if there are issues.

Next step is being notified when things break. I.e. anomaly detection and alerting.

Then, debuggability. Enough detail to solve issues. Disaster recovery testing is part of ensuring you actually have this, and not just believe you do.

Aside from that, there's CI/CD, automated scaling, automated isolation of bad actors. There are so many things one could do, but this also depends on how large the team is. I'll argue that this type of automation isn't that important if it's just one person.

The SRE book(s) [1] contain many of these high-level ideas. Don't try to do them all at once. :) (Bias: Niall, one of the editors, was my manager when I joined Google SRE.)

[1] https://sre.google/books/


👤 bheadmaster
Crash-only software, defined as software that is built from the grounds up to expect crashing, is very useful for stability.

If you write software from the beginning as if the only way to exit them is with SIGKILL, and if you make the application crash itself on any sign of fatal error, you get a reliable system.


👤 captn3m0
Understand and document your termination and startup signals. Any self-healing setup must:

1. Identify the state of your app (starting, running, ready, etc).

2. Stop traffic once the server goes down.

Very often you'll have state confusion (sigterm triggered, but server kept accepting requests). Make sure your signal handling works well across both your ingress, and the server.

A nice and easy hack is to have a /status endpoint in all of your apps that returns:

1. the current commit deployed

2. the availability of dependencies (db reachable? db connected? any missing environment variables?).

3. Which instance/pod/server is serving this request. (Just returning hostname typically works)


👤 gwbas1c
A few patterns I used in Syncplicity (Major Dropbox, OneDrive competitor.)

At a high-level, every operation had a try-catch that caught all exceptions. (This is similar to panic-resume in Go, but the semantics are very different.) A lot of operations fail on edge cases that you can never fully anticipate. (We had to deal with oddball network errors that wouldn't reproduce in our development/test environment, oddball errors caused by 3rd party applications that we didn't have, ect, ect.) It's important to have a good failure model...

... Which comes to exponential retry: Basically, for operations that could fail on corner cases, but essentially "had" to work, we'd retry with an exponentially-increasing delay. First retry after 1 second, then 2 seconds, then 4 seconds... Eventually we'd cap the delay at 15 minutes. This is important because sometimes a bug or other unpredictable situation will prevent an operation from completing, but we don't want to spam a server or gobble up CPU.

Try to make almost all operations transactional. (Either they succeed or fail, but they never happen in an incomplete / corrupted manner.) You can get that "for free" when you use a SQL database: We used SQLite for local state and almost exclusively used SQL in the server. For files that stored human-readable (QA, debugging) XML/JSON, we wrote them in a transactional manner: Rename the existing file, write the new version, delete the old version. When reading, if the old version existed, discard the new version and read the old one. We also implemented transactional memory techniques so that code wouldn't see failed operations.

Finally: Concurrency (threading) bugs are very hard to find, because they tend to pop up randomly and aren't easily reproducible. The best way to do concurrency is to not do it all. If you can make your whole application single threaded and queue operations, you won't have concurrency bugs. If you have to do concurrency, make sure you understand techniques like immutability, read/write locking, lock ordering, and reducing the total number of things you need to lock on. Techniques like compare-exchange allow you to write multithreaded code that doesn't lock/deadlock. Immutability allows you to have non-blocking readers, if readers can tolerate stale state.


👤 kkirsche
I think it’s important to review well established frameworks for fault tolerance. Things like Erlang’s OTP come to mind which demonstrates concepts like supervision trees.

I don’t have a great resource of these frameworks or patterns but my approach has been to learn from the characteristics of historically successful systems that I want similar capabilities to (if I can’t use the framework directly)


👤 bratao
One thing I learned from high frequency trading programming, it to, where possible, restart the servers/software's daily.

👤 adql
systemd unit with

    Restart=always
    RestartSec=30
is enough for 99% of the small apps

If you can get away with VPS(es) and cloud hosted DB that will be by far the simplest solution to manage. Or "serverless" if service is small enough and doesn't need to be persistently running.

Once it starts bringing actual money you might then start thinking about dockers, k8s and other fancy stuff


👤 diamondap
jasode's answer is spot on. At work, we run an digital preservation system that ingests large file sets into S3, Glacier and Wasabi. The files have through a pipeline of microservices to verify integrity, identify file formats, and extract other metadata.

We use AWS Elastic Container Service (ECS) to ensure all of the services are running and to scale when necessary. We use NSQ to make sure tasks are sent and re-sent to workers until they complete the entire pipeline. And we use Redis to store interim processing data from each worker.

Keeping that interim data (state) in a place where all workers can access it is key. ECS kills workers when it wants, and workers occasionally die due to out-of-memory exceptions. With the state info stored in Redis, a new worker can pick up any failed task where the last worker left off.

This system has been running well in production under heavy load. It replaces an older server-based system that used different technologies to handle the same responsibilities: supervisord to keep processes running and BoltDB to store interim processing data. That system worked, but could not scale horizontally because BoltDB is a local, disk-based store. Distributed workers need a shared, network-accessible store to share state info.

You'll find a detailed overview with diagrams of the new system at https://aptrust.github.io/preserv-docs/overview/

There's a shorter write-up of the goals and how we achieved them at https://sheprador.com/2022/12/architecting-for-the-cloud/

This stuff isn't too hard, as long as you get the pieces right. Try to stick to fredley's advice. Keep things simple and you'll save yourself many headaches going forward. Also, be sure your workers handle SIGTERM and SIGKILL explicitly, cleaning up their work as much as possible before they die.