1. Is it safe to rely on AWS for DR across multiple AZs or must we plan for a whole region being obliterated? 2. Do you recommend any tools or practices that we can adopt to help us adopt a good-enough DR strategy? It would be acceptable if getting back online business took a week or a few.
The simplest DR plan for AWS would be to back up your data and images in reliable ways. For example, put the data in S3 so one region being out wouldn't stop you from spinning up images in another region and getting back online. Keep up to date AMI's in multiple regions etc. If you can be down a few days this is the easiest and is totally reasonable. Just remember with the growth of data this can start getting unrealistic.
As the acceptable time shrinks to operational from failure you need to have a more and more sophisticated plan. From simple hot standbys in other regions to complete multi-region duplication.
It really doesn't have to be complex at first, start with a simple backup/recovery strategy and then add as your requirements demand it.
1) what notional dollar value can you place on the loss of service and information inherent in this AWS (in this case) deployment?
2) what % of that loss would you spend to limit exposure to that risk?
There is a third question:
Does your contract with Amazon include any penalties on AWS for service or data loss, beyond "free credits" and does it explicitly deny you rights to sue?
Ultimately, if us-east-1 goes down the whole web will be fucked anyway and your customers will be pretty forgiving to wait the brief amount of time for it to come back. (SLA nonwithstanding)
Chances are your MTTR for moving regions will take longer than waiting for the outage to be resolved.
If the app is essential to your business then an active/active plan behind multiple load balancer might be the way to go.
The bigger the business impact, the higher the cost and the more challenging to setup, maintain and test.
Primary has everything but DR had just a hot standby of database.