Do you know a good MySQL replication *failover* tutorial?

Question

Fellow HNers,I&rsquo;m finding lots of tutorials on setting up MySQL replication (the Digital Ocean ones are particularly great), but I&rsquo;m struggling to find good failover tutorials.So for example you&rsquo;ve got basic Source/Replica replication set up, but what&rsquo;s the best practise to fail over to said replica when the source dies?I have a rough plan that I&rsquo;ve worked out but I&rsquo;d love to see some meaty tutorials that cover this topic and demonstrate it in practise.Any links (or google schooling for my admittedly old brain) would be appreciated.Thanks!

eklitzke · Accepted Answer

I don't have any links but the way I would recommend doing this in production is a replication topology like A -> B -> {C, D, E, ...}. In this notation A is the master database, B is a replica of A, and {C, D, E, ...} are all replicas of B.
A and B need to have identical hardware specs (but the replicas can have a different spec if that makes sense). You won't issue any queries to B in normal operation, only possibly during failover. This is important because it lets you know that when a failover happens B will actually be able to handle the load that A was previously handling.
If A fails then you just change your app configs/proxies/whatever to use B as the master instead of A. Depending on how you have this set up you may not even need to run any commands on B, you can have it set up to just accept writes, so all you need to do for failover in this scenario is update your app/routing configs to use B as the master instead of A.
If B fails then things are still quite easy: you just need to update all the replicas {C, D, E, ...} to use A as their master. In this scenario you're not changing anything in your configs/proxies, you're just running some commands on the replicas to change which host they replicate from. This is always safe because A is by definition always ahead of B, so there are no replication race conditions. There's a small risk here that the additional load on A (because it's now replicating to N downstream hosts instead of just one) will cause problems, but usually replication overhead is low so this won't be an issue if you just make sure you've provisioned things so that A always has a bit of excess bandwidth/CPU capacity in case of a failover (having some excess capacity is just good common sense anyway).
If you have a LOT of database hosts (say, 10+) then in this design pattern at some point you're going to have a problem where there are so many replicas replicating from B that it will struggle to keep up with the bandwidth/load from replication. If you get to this point you can have some more complicated tiered fanout architecture where you have replicas of replicas.
Definitely make sure you practice failover from time to time (probably once when you set things up, and then once a year or so after that). I would recommend using some kind of staging cluster or docker/vm environment to make sure you're confident in running all the commands and have playbooks before you implement this in production.