Recovering MySQL Master-Master Replication

MySQL Master-Master replication is a common practice and is implemented by having the auto-increment on primary keys increase by n where n is the number of master servers. For example (in my.conf):

auto-increment-increment = 2
auto-increment-offset = 1

This article is not about implementing this but rather about recovering from it when it fails. A word of caution – this form of master-master replication is little more than a useful hack that tends to work. It is typically used to implement hot stand-by master servers along with a VRRP-like protocol on the database IP. If you implement this with a high volume of writes; or with the expectation to write to both without application knowledge of this you can expect a world of pain!

It’s also essential that you use Nagios (or another tool) to monitor the replication on all masters so you know when an issue crops up.

So, let’s assume we have two master servers and one has failed. We’ll call these the Good Server (GS) and the Bad Server (BS). It may be the case that replication has failed on both and then you’ll have the nightmare of deciding which to choose as the GS!

1) You will need the BS to not process any queries from here on in. This may already be the case in a FHRP (e.g. VRRP) environment; but if not, use combinations of stopping services, firewalls, etc. to stop / block access to the BS. It is essential that the BS does not process any queries besides our own during this process.

2) On the BS, execute STOP REPLICA to prevent it replicating from the GS during the process.

3) On the GS, execute:

STOP SLAVE; (to stop it taking replication information from BS);
FLUSH TABLES WITH READ LOCK; (to stop it updating for a moment);
SHOW MASTER STATUS; (and record the output of this);

4) Switch to the BS and import all the data from the GS via something like:

$ mysqldump -h GS -u root -psoopersecret --all-database --quick \
--lock-all-tables | mysql -h BS -u root -psoopersecret;

Note that I am assuming that you are  replicating all databases here. Change as appropriate if not.

5) You can now switch back to the GS and execute UNLOCK TABLES to allow it to process queries again.

6) On the BS, set the master status with the information your recorded from the GS via:

CHANGE REPLICATION SOURCE TO SOURCE_LOG_FILE='mysql-bin.xxxxxx', SOURCE_LOG_POS=yy;

7) Then, again on the BS, execute START REPLICA. The BS should now be replication from the GS again and you can verify this via SHOW REPLICA STATUS.

8) We now need to have the GS replicate from the BS again. On the BS, execute SHOW MASTER STATUS and record the information. Remember that we have stopped the execution of queries on the BS in step 1 above. This is essential.

9) On the GS, using the information just gathered from the BS, execute: 

CHANGE REPLICATION SOURCE TO SOURCE_LOG_FILE='mysql-bin.xxxxxx', SOURCE_LOG_POS=yy;

10) Then, on the GS, execute START REPLICA. You should now have two way replication again and you can verify this via SHOW REPLICA STATUS on the GS.

11) If necessary, undo anything from step 1 above to put the BS back into production.

There is a --master-data switch for mysqldump which would remove the requirement to lock the GS server above but in our practical experience, there are various failure modes for the BS and the --master-data method does not work for them all.

New Replication Features in MySQL 5.6

I’ve just been reading an article on replication in MySQL 5.6 and there are quite a few new cool features that will vastly improve replication environments with MySQL. Some of these include:

  • Optimised row based replication (documentation here) – replication used to mean execute the exact same queries on the slave server(s) as the master. This, as you can imagine, was incredibly inefficient. With row based replication, the resultant changed row of an INSERT, UPDATE or DELETE operation was replicated saving significant resources on slave servers. This was a big win. It now looks like it’s been further optimised in that only the changed elements (and a primary key) need to be replicated.
  • Introduction of Global Transaction Identifiers (GTIDs) allowing the source and sequence of a replication statement to be globally unique within a replication cluster. This with some cool new utilities (mysqlfailover and mysqlrpladmin) create a lot of native options for failover for maintenance or failure (see more about GTIDs and the utilities).
  • Time delayed replication allowing a slave to purposefully remain behind the master for any given delay. This may be a life saver for that table you accidently dropped!

There’s a lot more too and you can read about it all here.