HACKER Q&A
📣 wierdstuff

Do I need a DR plan in AWS?


I work for a very large organization and we have many systems that live in a single region in AWS, us-east-1. While we use multi-AZ RDS instances and EC2 instances across multiple AZ's we don't have regular backups to another AWS region. If us-east-1 was obliterated, we would lose a significant number of mission-critical systems. My questions:

1. Is it safe to rely on AWS for DR across multiple AZs or must we plan for a whole region being obliterated? 2. Do you recommend any tools or practices that we can adopt to help us adopt a good-enough DR strategy? It would be acceptable if getting back online business took a week or a few.


  👤 davismwfl Accepted Answer ✓
Yes, the answer is always yes you need a DR plan if you are running mission critical systems. The level and detail of the plan just varies with the need.

The simplest DR plan for AWS would be to back up your data and images in reliable ways. For example, put the data in S3 so one region being out wouldn't stop you from spinning up images in another region and getting back online. Keep up to date AMI's in multiple regions etc. If you can be down a few days this is the easiest and is totally reasonable. Just remember with the growth of data this can start getting unrealistic.

As the acceptable time shrinks to operational from failure you need to have a more and more sophisticated plan. From simple hot standbys in other regions to complete multi-region duplication.

It really doesn't have to be complex at first, start with a simple backup/recovery strategy and then add as your requirements demand it.


👤 ggm
I often answer questions like this by posing two other questions

1) what notional dollar value can you place on the loss of service and information inherent in this AWS (in this case) deployment?

2) what % of that loss would you spend to limit exposure to that risk?

There is a third question:

Does your contract with Amazon include any penalties on AWS for service or data loss, beyond "free credits" and does it explicitly deny you rights to sue?


👤 QuinnyPig
It’s important to keep your résumé up to date. Should a region be obliterated, you’ll make 10x more overnight by changing jobs to work somewhere that has WAY bigger problems than your current company does.

👤 yuppie_scum
Infrastructure as Code to define compute resources. Persistent data backup to global S3.

Ultimately, if us-east-1 goes down the whole web will be fucked anyway and your customers will be pretty forgiving to wait the brief amount of time for it to come back. (SLA nonwithstanding)

Chances are your MTTR for moving regions will take longer than waiting for the outage to be resolved.


👤 tony-allan
DR is there to mitigate business risk. For a hobby project a routine copy of your app and it’s data is fine.

If the app is essential to your business then an active/active plan behind multiple load balancer might be the way to go.

The bigger the business impact, the higher the cost and the more challenging to setup, maintain and test.


👤 cloudking
I think if you want a really reliable DR plan, you should also consider a multi cloud strategy. Being able to spin up a backup version of your service on another cloud (e.g GCP or Azure) is going to be more reliable than placing all your plans within AWS.

👤 tarun_anand
I agree with the post on using multi cloud. We recently did that and went multi cloud and multi region.

Primary has everything but DR had just a hot standby of database.


👤 rajacombinator
Yes. Hire a consultant to advise your company. Key terms are RPO and RTO.