Checklist for highly available bare metal deploy?

Question

We're a small startup in Shanghai, China, currently in stealth mode. Our core algorithm is relatively computationally intensive, and better adapted to CPU.Cloud here is not cheap (comparatively), e.g. a measly on-demand 1 core 1G RAM instance could run up to $23/month + traffic + storage. Prepaid is ~$10.However, we're able to buy (own) bare metal 2 x 2893v3 (24 cores) + 256G ECC RAM + 5 x 300G 15K drives with a battery-backed hardware RAID for about $1500. Another $60/month for 1U colocation with unlimited power and 10M unlimited link (bandwidth is crazy expensive in China, e.g. dedicated 100M link can cost up to additional $1500/month). I'd say 10 of these should be enough at our scale.The load is pretty typical web & database flow with some heavy Numpy/Scipy spikes from time to time.Let's assume that the engineer labor cost is zero.Could HN recommend some learning resources for minimizing pain with this setup?

zzzcpan · Accepted Answer

Probably the biggest thing is to separate edge servers from backend servers, running DNS and reverse http(s) proxies (aka load balancers) on the edge servers. This would allow you to achieve high availability on the internet facing layer and below. Use DNS routing and DNS failover to steer clients to specific edge servers and to avoid unavailable ones. The more edge servers you have from more independent hosting providers, the higher availability you can get. They can be cheap VPS servers. Failover between backends then can be handled on the edge servers on the reverse proxy layer. Backend servers won't even need physical IP addresses, they could tunnel to edge servers and be located anywhere. You would need to handle connectivity problems though, you can start by using fail fast approach with very low connect timeout and low timeouts and such to quickly and seemlessly switch between backends. I don't know of any off-the-shelf solution that can handle connectivity problems well, so at some point you will have to write your own tunneling proxy, but before that you can survive on nginx, it can mark backends as failed and avoid sending requests there for configurable amount of time. Start with at least 3 edge nodes from different hosting providers and 2 locations for backends, could be a single location if you can get 2 independent ISPs there.

quaquaqua1 · Answer

Before sending anything to cloud, I test it on a few different machines I have sitting in my office.
Then I sandbox it on some free tier cloud for a sanity check.
If you ever do send it to production on cloud, be ready to pull the plug if it ever goes crazy and wants to give you a crazy bill.
In China, electricity and parts and even land can be cheap compared to USA right? So I think that explains the lack of cloud providers.
Good luck!