But it does (again) raise the question I'd rather not think about. What if something happens to me and there's another outage that I can't fix?
So - how do you make sure that your servers are up as a one person founder? Can I pay someone to monitor my AWS deploy and make sure it's healthy?
* Redundancy. If you process background jobs, have multiple workers listening on the same queues (preferably in different regions or availability zones). Run multiple web servers and put them behind a load balancer. If you use AWS RDS or Heroku Postgres, use Multi-AZ deployment. Be mindful of your costs though, because they can skyrocket fast.
* Minimize moving parts (e.g. databases, servers, etc..). If possible, separate your marketing site from your web app. Prefer static sites over dynamic ones.
* Don't deploy at least 2 hours before you go to sleep (or leave your desk). 2 hours is usually enough to spot botched deploys.
* Try to use managed services as much as possible. As a solo founder, you probably have better things to focus on. As I mentioned before, keep an eye on your costs.
* Write unit/integration/system tests. Have a good coverage, but don't beat yourself up for not having 100%.
* Monitor your infrastructure and set up alerts. Whenever my logs match a predefined regex pattern (e.g "fatal" OR "exception" OR "error"), I get notified immediately. To be sure that alerts reach you, route them to multiple channels (e.g. email, SMS, Slack, etc..). Obviously, I'm biased here.
I'm not gonna lie, these things make me anxious, even to this day (it used to be worse). I take my laptop everywhere I go and make sure that my phone is always charged.
Kids these days.
I had a RAM stick fry in one of the physical machines sitting in a colo 1 hour drive away. Not die, but just start flipping bits here and there, triggering most bizarre alerts you can imagine. On the night of December 24th. Now, that was fun.
--- To add ---
If you are a single founder - expect downtime and expect it to be stressful. Inhale, exhale, fix it, explain, apologize and then make changes to try and prevent it from happening again. Little by little, weak points will get fortified or eliminated and the risk of "incidents" will go down. There's no silver bullet, but with experience things become easier and less scary.
Billing issues. What happens if the credit card you use to pay for everything gets hijacked, and you're trapped with a blocked card trying to clean it up but your bank is taking their sweet time and won't give you another card until it's sorted? ALWAYS have a backup credit card.
DNS Registrar. There's a hard SPOF in the DNS, where your registrar essentially holds your domain name hostage. If your DNS gets hijacked, but your registrar is taking a few days to sort out who actually owns it, you're down hard. There's no mitigation for this one, except paying for a registrar with proper security processes. If you do 3FA anywhere, make it here.
AppStore. If your app gets banned, or a critical update blocked, what do you do? Building in a fallback URL (using a different domain name, with a different registrar, can help work around any backend issues. There's not much you can do for the frontend functionality, except using a webapp.
It can be worthwhile looking at risks and possible mitigations beyond just server and database issues, especially when it's just you.
Random things will go wrong that you can't predict. Boxes will die suddenly and without reason, even after months of working fine without changes, and always at the worst possible moment. Your system needs to be built to withstand that.
I'll take the opposite approach of everyone here and recommend against serverless, kubernetes, and Heroku/PAAS.
You are a solo founder. You should understand your infra from the ground up (note: not understand an API, or a config syntax, but how the underlying systems actually work in great detail). It needs to be simple conceptually for you to do that. If anything goes wrong, you need to be able to identify the cause and fix it quickly.
I've gone through this first-hand and know all the trade-offs. If you'd like, I'm happy to discuss architecture decisions on a call. Email is in my profile.
All my SaaS products run on a Windows server, with SQL Server as a database and ASP.NET on IIS running the public sites. You can probably come up with a lot of uncharitable things to say about those technologies, but "flimsy" and "fragile" likely aren't in the list.
As a result, when things go seriously wrong, the application pool will recycle itself and the site will spring back to life a few seconds later. Actual "downtime", of the sort that I learn about before it has fixed itself might happen maybe once ever couple years. At least, I seem to remember it having happened at least once or twice in the last 15 years of running this way.
There's a Staging box in the cage, spun up and ready to go at a moment's notice, in case that ever changes. But thus far it has led a very lonely life.
1) pagerduty.com or uptimerobot.com for remote monitoring to make sure you site(s) are up (and get alerts when they're not).
2) Datadog or New Relic if you want deeper monitoring (application performance, database performance, diagnostics/debugging.
3) Rollbar.com (site doesn't seem to respond) for site performance/errors.
4) Roll your own with Prometheus (https://prometheus.io/, or Nagios (https://www.nagios.org/)/IcingA. Or... strangely - I still use MRTG for a few perf monitoring things: https://oss.oetiker.ch/mrtg/
5) If you want to monitor the status of deploys/builds - I love integrating CI/CD systems with Slack - very helpful.
Hope that helps - I've spent a lot of my career monitoring things, and have this mantra that I need to know about services down before customers call to tell me same.
(a lot of these have free tiers)
Honestly, the answer is learning how to manage anxiety and stress, particularly doing potentially destructive things under pressure. I think the psychological aspects of this are much more difficult than the technical ones.
If it helps, people are generally very understanding if you explain that you are a solo founder, and take reasonable steps to fix issues in a timely way. Most customers assume every company is a faceless organization; their attitude is much more forgiving when they learn they're dealing with a fellow person.
You cannot be on call 24/7 forever. You will burn out. If you can't hire someone you trust to take over part of this burden, then you have to accept the risk of sometimes not being able to log in for N hours if there is an outage (because you're camping with your spouse, etc.)
For very high-stress situations (database crash, recovery from backup) working from a checklist that you have tested is very valuable.
Good luck to you, and I hope you found useful answers in this thread!
Before thinking about handing over management of the deployment, I would encourage you to think about what the root cause of the outage is and whether something in the app will create that situation again. I invested in setting up DataDog monitoring for all hosts with alerts on key resource metrics that were causing issues (CPU was biggest issue for me).
The other thing that's worked well for me is just keeping things simple. As a solo founder, time spent with customers is more valuable than time spent on infrastructure (assuming all is running well). It's a little dated, but I still think this is a good path to follow as you're building your customer base. A simple stack will let you spend more time learning how your product can help your customers best.
http://highscalability.com/blog/2016/1/11/a-beginners-guide-...
My system integrates with an external system and what happened is this external system started sending me unexpected data, which my system wasn't able to handle, because I didn't expect it so never thought to test for it -- the issue was that I was trying to insert IDs into a uuid database field, but this new data had non-uuid IDs. Because the original IDs were always generated by me, I was able to guarantee that the data was correct, but this new data was not generated by me. Of course, sufficient defensive programming would have avoided this as this database error shouldn't have prevented other stuff from working, but my point is that mistakes get made (we're humans after all) and things do get overlooked.
The problem is, restarting my service doesn't prevent this external data from getting received again, so it would simply break again as soon as more is sent and the system would be in this endless reboot loop until a human fixes the root cause.
That's a problem that I worry about, no matter how hard I try to make my system auto-healing and resilient (I don't know of any way to fix it other than putting great care into programming defensively), but again, we're human, so something will always slip through eventually...
Some people are suggesting to out-source an on-call person. That seems to me like the only way around this particular case. (The other suggestions can still be used to reduce the amount of times this person gets paged, though)
I assume also you want a simple way to increase reliability while keeping costs within reasonable limits.
Well, AWS can give you all that if you don’t want to go super fancy. Check Beanstalk to get something simple and reliable. Monitor using CloudWatch. Make sure to leverage redundancy options (multi az, multi region if worth it, etc). These are some general tips but with the information that you provide that’s all I can say.
You can also pay a consultant to get a review of your setup and get some recommendations. It won’t be cheap but it depends how much you value your time and your product.
The question you should be asking is, how can I make my service automatically recover from this problem. It depends why exactly it crashed. If a simple restart fixes the problem, there are different ways you can automate this process, like Kubernetes or just writing scripts.
I’m happy to give more detailed advice if you would like, my email is in my profile.
The question is not whether your system will fail, the question is when.
Have proper monitoring and alerting in place.
But don't over engineer it, sometimes everything seems technically fine, but your support inbox will start getting user complaints.
Resolve the issue, figure out the root cause, make sure this or similar stuff won't happen, apologise to the affected users if necessary, and move on.
You'll learn waaay more failure modes of your application running in the wild, than just thinking about "what could go wrong".
It's a long game of becoming a better developer/devops guy, and not repeating the same mistakes in the future.
On the set up, try your best to solve issues and use tried and true hardware, but things go down sometimes, even big sites like Google, Facebook go down, there is no silver bullet, you can only improve on your past mistakes.
Last, try to find some remote help, on a contract basis, it's not that expensive and it can help alleviate a lot of your stress.
If it's truly critical to have no down time then you probably need to build that resilience in to your architecture.
I use a python monitoring script that tails logs watching for ALERT level log lines and constant order activity combined with a cron watchjob to ensure the process is alive during trading hours. The exception handler in the monitoring script sends alerts if the script itself dies.
If there are any issues I use twilio to text me the exception text/log line. I also use AWS SES to email myself but getting gmail to permanently not block SES is a pain in the ass. By design Twilio + AWS SES are the only external dependencies I have for the monitoring system (too bad SES sucks).
On my phone I have Termius SSH setup so I can log in and check/fix things. I have a bunch of short aliases in my .profile on the trading server to do the most common stuff so that I can type them easily from my phone.
I also do all my work through a compressed SSH tmux including editing and compiling code. So if things get hairy I can pair my phone with my laptop, attach to the tmux right where I left off, and fix things over even a 3G connection.
This compressed SSH trick is a huge quality of life improvement compared to previous finance jobs I've worked where they use Windows + Citrix/RDP just to launch a Putty session into a Linux machine. It's almost like finance IT has never actually had to fix anything while away from work.
It doesn't prevent an app-level outage (corrupted data in the database, bad architecture,...) but at least I don't have to worry about servers going down anymore.
As for the rest, unit & extensive integration tests along with continuous integration and linting. Oh, and a typed language. Moving from Javascript to Typescript was a blessing. But I still miss Swift.
- Atleast have a pool of 2 instances (ideally per service) running under an auto-scaler or a managed K8s (GKE is best) with LB in front. May also want to explore EBS and google cloud run. If you can use them, use them!
- Uptime alerts. pingdom (or newrelic alerts) with pagerduty added.
- Health checks! The trick is to recover the failed container/pod/service before you get that pagerduty call. Ideally, if you have 2 of each service running #2 will handle the requests until the #1 is recreated.
- Sentry + newrelic APM + infra: You should monitor all error stack traces, request throughput, avg response time. For infra, you mainly need to watch memory and CPU usage. Also on each downtime, you should have greater visibility at what caused it. You should set alerts on higher than normal memory usage so you can prevent the crash.
- Logs, your server logs should be stored somewhere (stackdriver on gcloud or cloudwatch on aws).
These might sound overwhelming for a single person but these are one time efforts after which they are mostly automatic.
2. Pay for a Business Support plan. https://aws.amazon.com/premiumsupport/pricing/
3. Call business support about something "how do I restart my server" - so you know how to file a ticket, get a feel for how quick the response is and how it works.
Do not over think this. EG: terraform templates
I've been in the game for a while and every time I run across an idea for a service, there's always a question of whether I'd be OK with sleeping with a pager, remoting to the servers at 4 am on Saturday and generally be slaved to the business. The answer, upon some reflection, is inevitably No. This is the domain of teams.
I'm in the same boat with my solo founder projects (links in profile).
The short answer: I'm married to my phone/laptop.
My test coverage is good. I use managed services when possible so I don't need to play sysadmin. I don't deploy before I leave for something (dinner, shower), and I have some pretty good redundancy across all my services. If one node goes down, I'm safe. If four go down (incredibly unlikely), well, fuck, at least my database was backed up and verified an hour ago.
I invested a large amount of time into admin-y stuff. My admin-y stuff is solid and I can tweak/config/ccrud anything on the fly. I credit being able to relax thanks to my admin-y stuffs. Obviously, if shit really hits the fan with hardware or an OS bug, I need to get to my laptop. But over the last six years, I haven't had to do that yet, and hopefully I won't have to.
I've explored adding staff — mainly for day-to-day operations — but I like the idea of interfacing with my customers and I credit growing things to where I have because I'm in the trenches with them. Things haven't always gone smoothly, and my customers always let me know, but any issues are normally swiftly-resolved.
The scale of one of my products is non-trivial and has a ton of moving parts — some of which I'm in no control of and could change at any time and break _everything_. It sounds terrifying, and it is, but I've made a habit to check things before peak hours. If something's amiss, a quick fix is usually all it takes.
I have a few pieces of advise:
1. Make sure your service can safely fail and be restarted. What I mean is, if somebody is POST'ing data or making database changes, make sure you handle this safely and attempt some recovery. Something not being fully processed is okay as long as you are able to handle it.
2. Self-monitoring. I run all my systems inside a simple bash loop that just restarts them and pop me an email (i.e. "X restarted at Y" and then "X is failing to start" if it continues).
3. External monitoring via a machine at home that rolls the server back to a previous binary (also on the server). It also pulls various logs from the server, as well as the binaries, so they can be analyzed. Okay, it has some reduced functionality, but it's stable and will keep things going until the problem is fixed.
4. Make sure your service fails inconveniently - i.e. returns a `{"status":"bad"}` string or something, or defaults to a "Under maintenance page, please come back soon". Your service going down is one thing, but becoming completely unresponsive is quite another.
One thing I can't prepare for (which happens more than you think) is the server itself crashing, which as you say, means I'm randomly logging into a VPS console and rebooting. I use a bunch of different VPS providers and every one of them has a slightly different console.
You can do it at the OS level: on a windows OS for example: you use EventViewer and assign a task to specific type of log captured by the OS this task can then invoke a small app that sends emails if an error-log occurs or something like that
Application specific issue:
you can manually capture exceptions raised within the app and send notifications
there are many clever ways to do this and not hinder performance, and also not pollute your code-base with exception handling
you can spawn "fire and forget threads" that send notifications ...
let me know if need more ideas here
Integration tests:
given that you've built a strong suite of integration-tests covering all the functionality on your app
you can have have your integration tests run every 15min or so and send notifications if tests fail
You can also use monitoring tools. I know Azure offers ways to help with this.
Reach out if want more ideas or more specific solutions
Managed support will usually only monitor and fix basic infrastructure and respond to support requests from you. They often won't monitor or fix your applications/services; for that you can set up your own application monitoring and tests. NewRelic is a good all-in-one choice, but there are plenty more out there. To call you during an incident, you'd also adopt PagerDuty.
In order to avoid service outage in general, you want to hook up some kind of monitor to something that re-starts your services and infrastructure. This will only fix crashes; it won't fix issues like disks filling up, network outages, application bugs, too many hits to your service, etc.
You should be able to find small businesses who specialize in selling support contracts for all levels of support. By signing a contract and on-boarding a 24/7 support technician, you can get them to do basically whatever you need to be fixed when it goes down. I don't have suggestions for these, maybe someone else does (it used to be common for SMBs in the 2000's).
If I were you, I would use free monitoring services like uptimerobot. There are some other options available. Typically these services provide some basic functionality for free, it would be enough for a small enterprise.
On AWS it is quite easy to create your own external probes for a reasonable price. However, it would require some basic programming skills.
#!/bin/bash
thisHtml=`curl -s "[your site's web address]"`
if [[ $thisHtml != •"
#echo "Server is down"
ssh -i "[your pem file]" -t ec2-user@[ip address] 'sudo /bin/bash -c "killall -9 node"'
ssh -i "[your pem file]" -t ec2-user@[ip address] 'sudo /bin/bash -c "export PATH="/root/.nvm/versions/node/v8.11.2/bin:/usr/lib/node:/usr/local/bin/node:/usr/bin/node:$PATH" && forever start /var/www/html/[...]/index.js"'
rebootDate=`TZ=":[your time zone]" date '+%Y-%m-%d %H:%M:%S'`
echo "$rebootDate" >> "/home/ec2-user/serverMonitoring/devRestarts.txt"
fi
Yes. There are consulting shops that will do this, as will many of the monitoring tools listed in the thread (though these tools will not fix the problem for you). Broadly speaking, there is a cost associated with this, as well as the cost associated with your downtime. If the cost of your downtime (reputational risk, SLA credits, etc) outweighs the cost of hiring someone to cut your MTTR to 5 minutes (assuming you can playbook out all of the relevant scenarios) + provides some value in stress reduction, then you should do this. If you've been doing this a while, you can math it out. In what experience I've had though, an outside person is unlikely to be able to fix an "unknown unknown", they just won't know your environment as well as you will.
All that said, one hour of service interruption a year is still better than most.
The idea that your server does not perform regular health checks or spin itself back up when it fails just seems weird to me now. I like being spoiled.
Ultimately you can engineer your systems, even if they are quite complex, to be manageable by a single person. It's not one thing though. It's years of experience and gut feel. It's also totally distinct from technology.
Some things that come to mind:
- use queues for background tasks that may need to be retried. If things go down and you have liberal retry policies, things should recover.
- use boring databases. Just stay away from mongo and use something like rds which is proven and reliable.
- be careful in your code about what an error is. Log only things at the error level you need to look at.
- test driven development. Saves a ton of time.
It's all the matter of defining requirements, then solutions and tradeoffs of those solutions and then implementing it with best practices in mind (automation, testing, monitoring, backups, etc.).
Hit me up if you want to discuss it over a pint! :)
That's what we provide at Clever Cloud BTW https://www.clever-cloud.com/
I feel like we over engineer that part. Sure there's plenty of service where you don't want any downtime and it makes sense to over engineer it (like any monitoring service) but for many SaaS, the worst that will happens is a few emails.
Maybe write a simple SLA, something with a 8 hours response over theses kinds of outage. If some client require more, than sell them a better SLA at an higher cost. That should let you invest into better response time for sure.
No cloud machines, no hosted cloud services for production beyond DNS.
* 3 machines in separate data centers (equivalent of AWS AZs) for >= 30 EUR/month each. ECC RAM.
* These machines are /very/ reliable. Uptime of > 300 days are common, reboots happen only for the relevant kernel updates.
* Triple-redundancy Postgres synchronous replication with automatic failover (using Stolon), CephFS as distributed file system. I claim this is the only state you need for most businesses at the beginning. Anything that's not state is easy to make redundant.
* Failure of 1 node can be tolerated, failure of 2 nodes means I go read-only.
* Almost all server code is in Haskell. 0 crash bugs in 4 years.
* DNS based failover using multi-A-response Route53 health checks. If a machine stops serving HTTP, it gets removed from DNS within 10 seconds.
* External monitoring: StatusCake that triggers Slack (vibrates my phone), and after short delay PagerDuty if something is down from the perspective of site visitors.
* Internal monitoring: Consul health checks with consul-alerts that monitor every internal service (each of the 3 Postgres, CephFS, web servers) and ping on Slack if one is down. This is to notice when the system falls into 2-redundancy which is not visible to site visitors.
* I regularly test that both forms of monitoring work and send alerts.
* Everything is configured declaratively with NixOS and deployed with NixOps. Config changes and rollbacks deploy within 5 seconds.
* In case of total disaster at Hetzner, the entire production infrastructure can be deployed to AWS within 15 minutes, using the same NixOps setup but with a different backend. All state is backed up regularly into 2 other countries.
* DB, CephFS and web servers are plain processes supervised by systemd. No Docker or other containers, which allows for easier debugging using strace etc. All systemd services are overridden to restart without systemd's default restart limit, to come back reliably after network failures or out-of-memory situations.
* No proprietary software or hosted services that I cannot debug.
* I set up PagerDuty on Android to override any phone silencing. If it triggers at night, I had to wake up. This motivated me to bring the system to zero alerts very quickly. In the beginning it was tough but I think it paid off given that now I get alerts only every couple months at worst.
* I investigate any downtime or surprising behaviour until a reason is found. "Tire kicking" restarts that magically fix things are not accepted. In the beginning that takes time but after a while you end up with very reliable systems without surprises.
Result: Zero observable downtimes in the last years that were not caused by me deploying wrong configurations.
The total cost of this can be around 100 EUR/month, or 400 EUR/month if you want really beefy servers that have all of fast SDDs, large HDDs, and GPUs.
There are a few ways I'd like to improve this setup in the future, but it's enough for the current needs.
I still take my laptop everywhere to be safe, but didn't have to make use of that for a while.
You enable them to do it for you by creating HA infrastructure. Start by creating an autoscaling group that enforces a certain number of working application endpoints. You probably need an alb too. An app endpoint that fails healthcheck causes the asg to spin up another instance and auto-register with the alb. (You can snapshot your configured and working app endpoint as the base image).
It's just extra redundancy in case something like cloudwatch (which you should use -- with ELBs) also goes down.
It's not perfect, but it's (1) cheap (2) easy (3) quick (the mythical trifecta). It misses some of issues due to high loads (but still technically available) but works perfectly when things actually crash (like queue workers deciding to turn off).
Docker, elastic beanstalk, SNS, and the hidden world of AWS instance performance are all a PITA. Oh yea, certs...
Welcome help as well.
I also had a lot of free AWS credits, so I migrated to AWS. I didn't want to write all my terraform templates from scratch, so I spent a lot of time looking for something that already existed, and I found Convox [2].
Convox provides an open source PaaS [3] that you can install into your own AWS account, and it works amazing well. They use a lot of AWS services instead of re-inventing the wheel (CloudFormation, ECS, Fargate, EC2, S3.) It also helps you provision any resources (S3 buckets, RDS, ElastiCache), and everything is set up with production-ready defaults.
I've been able to achieve 100% uptime for over 12 months, and I barely need to think about my infrastructure. There's even been a few failed deployments where I needed to manually go into CloudFormation and roll something back (which were totally my fault), but ECS keeps the old version running without any downtime. Convox is also rolling out support for EKS, so I'm planning to switch from ECS to Kubernetes in the near future (and Convox should make that completely painless, since they handle everything behind the scenes.)
I also warns me if the CPU load goes up over 80%.
For the first two years of going live I had this hardwired to my Pebble via real-time mail, but now I know my platform is robust; so I can choose worry about other things.
(though this is essentially a single-line comment, it's earnest, not intended to be sarcastic)
For monitoring, I am using Stackdriver which has easy-to-use health check.
It’s important for critical services, yet if you lose your 2FA device, like a phone, you will be locked out for a while. Like many things, it will happen at a bad time.
Also, send errors through chat platforms like Telegram to be notified of any errors/monitor the servers
These are both reactionary but at least you'll know if things break.
Other options includes using another service that offers 24/7 uptime. Obviously you pay more for that.
This approach is easy to implement and scale
https://www.mnxsolutions.com/services/linux-server-managemen...
I’d be happy to chat with anyone, even if to provide some feedback or a quick audit to help you avoid the next outage.
- nick at mnxsolutions com
Funny enough I was just talking to someone who passed all his AWS certifications and was looking for some AWS work.
So given that, just do the right things to prevent things going down and get to a reasonable level of comfort.
I recently shut down the infrastructure for my (failed) startup. Some parts of that had been up and running for close to four years. We had some incidents over the years of course but nothing that impacted our business.
Simple things you can do: - CI & CD + deployment automation. This is an investment but having a reliable CI & CD pipeline means your deployments are automated and predictable. Easier if you do it from day 1. - Have good tests. Sounds obvious but you can't do CD without good tests. Writing good tests is a good skill to have. Many startups just wing it here and if you don't get the funding to rewrite your software it may kill your startup. - Have redundancy. I.e. two app servers instead of 1. Use availability zones. Have a sane DB that can survive a master outage. - Have backups (verified ones) and a well tested procedure & plan for restoring those. - Pick your favorite cloud provider and go for hosted solutions for infrastructure that you need rather than saving a few pennies hosting shit yourself on some cheap rack server. I.e. use Amazon RDS or equivalent and don't reinvent the wheels of configuring, deploying, monitoring, operating, and backing that up. Your time (even if you had some, which you don't) is worth more than the cost of several years of using that even if you only spend a few days on this. There's more to this stuff than apt-get install whatever and walking away. - make conservative/boring choices for infrastructure. I.e. use postgresql instead of some relatively obscure nosql thingy. They both might work. Postgresql is a lot less likely to not work and when that happens it's probably because of something you did. If you take risks with some parts, make a point of not taking risks with other parts. I.e. balance the risks. - When stuff goes wrong, learn from it and don't let it happen again. - Manage expectations for your users and customers. Don't promise them anything you can't deliver. Like 5 nines. When shit goes wrong be honest and open about it. - Have a battle plan for when the worst happens. What do you do if some hacker gets into your system or your data-center gets taken out by a comet or some other freak accident? Who do you call? What do you do? How would you find out? Hope for the best but definitely plan for the worst. When your servers are down, improvising is likely to cause more problems.
Otherwise I'd suggest religiously documenting your outage root causes and contemplating hard what could've avoided that outcome.
Then lastly for monitoring on the cheap:
Sentry.io - alerts.
Opsgenie - on-call management.
Heroku+new relic - heartbeat & performance.
Tldr; Keep your stack small and nimble and try to learn from past outages
- add health check mechanisms
- if health check is broken => restart service
- if restart service doesn't help after X retry => redeploy previous state (if any available)
Try to use Kubernetes or Docker Swarm if possible, combined with Terraform