There are occasions where something goes wrong and a problem prevents one or more slave servers from updating. The reasons for this can be several, but most often some sort of discrepancy between the master's and the slave's data set will cause a statement to fail on the slave that was executed properly on the master (otherwise it would not have made it to the binlog).

This is where the basic principle of assuming master and slave being equal becomes a little too simple. It can lead to a potentially long series of statements executing on the slave, but on a different set of data than the master has. Depending on how long this goes unnoticed, the master and slave can drift out of sync unnoticed, until a statement cannot be executed successfully on the slave—for example because a foreign key constraint fails on the slave.

Fortunately, not every problem stems from such a serious error, which can often only be repaired by resetting the affected slaves to a known good state.

Often a slave stops the replication because a record to be inserted is already present, resulting in key uniqueness violation error. This is especially likely when (accidentally or on purpose) you are working on the master and the slaves, modifying data on both sides maybe even to fix a replication problem.

In this recipe, we will show you how to skip one or more problematic queries—meaning instructions replicated from the master that will not execute correctly on the slave machine.

Getting ready

We will demonstrate the skipping of problematic queries in a contrived error scenario. To try this for yourself, you will need two MySQL servers set up as master and slave, being currently in sync. As an example, we will use the sakila sample database to demonstrate a record INSERT that fails on the slave because it was previously inserted manually by accident.

How to do it...

Connect to the master using a MySQL client. Make sakila the default database.
With a second client, connect to the slave. Make sakila the default schema here as well.
On the slave, enter the following command to insert a new category:
```
slave> INSERT INTO category (name) VALUES ('Inserted On Slave');
```
In this case, the category_id column was automatically set because it is defined as auto-incrementing. At this point, the master and the slave are already out of sync because this record does not exist on the master.
On the master, insert a new record as well:
```
master> INSERT INTO category (name) VALUES ('Inserted On Master');
```
You can see that the master also picked 17 as the category_id. It has been written to the binlog and has by now probably been replicated to the slave.
Have a look at the replication status on the slave:
```
slave> SHOW SLAVE STATUS \G
************************** 1. row ***************************
...
Slave_IO_Running: Yes
```
```
Slave_SQL_Running: No
```
```
...
Seconds_Behind_Master: NULL
```
You can see that the replicated insert has failed. No more statements replicated from the master will be executed (Slave_SQL_Running: No).
Repair the damage by making sure the slave records are identical:
```
slave> UPDATE category SET name='Inserted On Master' WHERE category_id=17;
```
Now the data on master and slave are identical again.
Tell the slave that you want to skip the (one) INSERT statement that came from the master and that cannot be executed:
```
slave> SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1;
```

Start the slave SQL thread again and check the replication status:

slave> START SLAVE;
slave> SHOW SLAVE STATUS \G
************************** 1. row ***************************
...
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
...
Seconds_Behind_Master: 0

You can see that the replication is up and running again.

How it works...

When the slave data was out of sync in such a way that a statement from the master would fail (in this case because of a duplicate identity column value), the slave server stopped executing any more SQL statements, even though in the background they were still read from the master and stored in the relay logs. This is what the Slave_IO_State and Slave_IO_Running columns from the output of SHOW SLAVE STATUS say.

MySQL does this to give you a chance to look at the problem and determine if you can repair the situation somehow. In this very simple example, the solution is simple because we can easily bring the slave's data back in sync with the master by modifying the record in question to match the contents that were sent from the master and then skip the INSERT replicated from the master using the SQL_SLAVE_SKIP_COUNTER global variable. This will skip exactly one statement from the relay logs, when the slave is next told to start. In our case, this is the problematic INSERT.

After that the replication is back in sync, as master and slave are now based on identical data sets again, allowing the following statements to be replicated normally.

There's more...

Another solution in this particular case could have been to delete the record on the slave and then restart the replication with START SLAVE. As the INSERT from the master has not been executed yet, replication would continue as if nothing had happened.

However, under more realistic circumstances, when confronted with a situation like this, you might not have a chance to delete the row on the slave due to foreign key constraints. Only because we immediately took care of the problem and we were sure that in the meantime no programs could have written to the slave, possibly creating new references to that record, were we able to remove it.

Depending on your application architecture and how fast you noticed the problem, some process might have been writing data to the slaves to tables that are designed for this purpose, linking to the now present category_id 17 and effectively preventing you from deleting it.

While in this simple case, we would be sure that the replication setup is now back to normal again, you often will not be able to tell for certain at which point in time a replication problem originated. INSERT statements of duplicate keys will immediately cause an error to become apparent. UPDATE or DELETE statements will often succeed in executing, but would have different effects on the slave than on the master, when they were previously out of sync.

Note

Problems like this can corrupt the data on your slaves silently for extended periods of time. When you find out in the end, it is often too late to recover without resorting to setting up the slave afresh.

When in doubt, we recommend to first use mk-table-checksum as described in the Checking if servers are in sync recipe in this chapter, or more generally to set up the slave from a well-known good state to be completely sure!

MySQL Admin Cookbook LITE: Replication and Indexing

MySQL Admin Cookbook LITE: Replication and Indexing

Overview of this book

Related Content you might be interested in

Current Title:

MySQL Admin Cookbook LITE: Replication and Indexing

Skipping problematic queries

Getting ready

How to do it...

How it works...

There's more...

Note