Book Image

High Availability MySQL Cookbook

By : Alexander Davies
Book Image

High Availability MySQL Cookbook

By: Alexander Davies

Overview of this book

High Availability is something that all web sites hope to achieve, especially those that are linked to big companies.MySQL, an open source relational database management system (RDBMS), can be made highly available to protect from corruption, hardware failure, software crashes, and user error. Running a MySQL setup is quite simple. Things start getting complex when you start thinking about the best way to provide redundancy. There are a large number of techniques available to add 'redundancy' and 'high availability' to MySQL, but most are both poorly understood and documented.This book will provide you with recipes showing how to design, implement, and manage a MySQL Cluster and achieve high availability using MySQL replication, block level replication, shared storage, and the open source Global File System (GFS).This book covers all the major techniques available for increasing availability of your MySQL databases. It demonstrates how to design, implement, troubleshoot and manage a highly available MySQL setup using any one of several techniques, which are shown in different recipes. It is based on MySQL Cluster 7.0, MySQL (for non clustered recipes) 5.0.77, and CentOS / RedHat Enterprise Linux 5.3.The book starts by introducing MySQL Cluster as a technology and explaining how to set up a simple cluster. It will help you to master the options available for backing up and restoring a file in the MySQL Cluster. By following the practical examples in this book, you will learn how to manage the MySQL Cluster. Further, we will discuss some troubleshooting aspects of the MySQL Cluster.We also have a look at achieving high availability for MySQL databases with the techniques of MySQL Replication, block level replication, shared storage (a SAN or NAS), and DRBD.Finally, you will learn the principles of Performance tuning and tune MySQL database for optimal performance.
Table of Contents (17 chapters)
High Availability MySQL Cookbook
Credits
About the Author
About the Reviewers
Preface
Base Installation
Index

Recovering from a cluster shutdown


This recipe will cover the procedure to follow in case of a cluster shutdown. We discuss the split brain problem and also explain how to start a cluster without all storage nodes.

How to do it…

This section covers the procedure to follow—both for a partial failure (some nodes fail, but the cluster remains operational) and a complete failure (all nodes fail!).

  • Partial cluster failure

    In the event of a single node failing, you will notice the following:

    • If the node that fails is a management node—no immediate problem occurs, but other nodes cannot restart and activities requiring a management node (online backups, centralized logging) will not take place

    • If the node that fails is a storage node—assuming one node remains in each nodegroup, there will be no immediate action (but there is the possibility of a small number of transactions being rolled back)

    • If the node that fails is a SQL node—any clients connected to that SQL node clearly will either have to use another SQL node or will fail, but no effect on the cluster

    To recover from a partial shutdown, carry out the restart procedure in the previous recipe—Restarting a MySQL Cluster without downtime; however, it may not be necessary to kill the existing process.

    Note

    If you do find a zombied (crashed) process remaining, you should first kill that process and then restart the node.

  • Complete cluster failure

    The following errors can cause a total cluster shutdown:

    • Catastrophic software bug that causes multiple cluster nodes to fail

    • Every node in the cluster loosing power (an entire facility failing for example), split brain condition (that will be discussed shortly)

    • Malicious or mistaken users gaining access to the management node or any storage node

    A split brain problem refers to the problem of cutting a cluster suffering some communication problems between nodes. If we have four nodes and split them into two groups of two nodes each (perhaps, through the failure in a switch), there is absolutely no way for either of the pairs to tell if the other node is working or not. In this case, the only safe thing to do is to shut down both the nodes, even though both pairs could have all the data required to carry on working.

    Imagine what would happen if your four-data nodes, two-storage nodes cluster continued working as two separate clusters—and then you had to attempt to reconcile two completely different databases!

    MySQL Cluster gets around this with the concept of an arbitrator—put simply, the cluster nodes elect a single node to act as the arbitrator while all nodes can still communicate.

    In the event of nodes loosing each other's contact, they (as a new group) ask the following questions:

    • Do we (nodes I can now talk to) have enough nodes to remain viable (one storage node per nodegroup)?

    • Can I see the previously agreed arbitrator?

    Unless the answer is yes for each node, the cluster will shut down with an error similar to the following appearing in the log:

    Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and cannot form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.

    Note

    The arbitrator is typically the management node, but can be a SQL node and you can specify ArbitrationRank=1 in config.ini file to make a node of high priority to become the cluster arbitrator.

    Recovery of a full cluster shutdown is conceptually simple—we need to start all storage nodes. It is likely that storage nodes would have killed themselves or had been killed by whatever caused the outage. So the procedure is identical to the rolling cluster restart without killing the existing processes. In other words, start the management node (ndb_mgmd), start all storage nodes (ndbd), and start all SQL nodes (start or restart mysqld).

How it works…

During a full-cluster start up, the storage nodes will start and will have to use their local copies of data that they stored to disk (It is likely that there will be some data loss after a total cluster shutdown). By default, a running MySQL Cluster will commit a Local Checkpoint (LCP)—a copy of all the local data held on disk—every time 4 MB of changes are made to the cluster (since the previous LCP).

A MySQL Cluster will also take a global checkpoint (all transactions that have occurred since the last LCP) to disk every two seconds. A storage node when starting from a full shutdown will apply all local transactions up to the last LCP, and then apply them up to two seconds of transactions from the latest global checkpoint to get data that is as up-to-date as possible. Because global checkpoints are made consistently across the cluster, this allows for consistent recovery of all nodes.

There's more…

In the case of a total cluster shutdown, it may happen that a storage node is damaged and cannot be repaired quickly. It is possible to start your cluster with only one storage node per nodegroup. To do this, pass the --nowait-nodes=<NODES> where <NODES> is a comma-separated list of nodes not to wait for. For example, in this example cluster:

[ndbd(NDB)] 4 node(s)
id=3 @10.0.0.1  (mysql-5.1.34 ndb-7.0.6, Nodegroup: 0)
id=4 @10.0.0.2  (mysql-5.1.34 ndb-7.0.6, Nodegroup: 0, Master)
id=5 @10.0.0.3  (mysql-5.1.34 ndb-7.0.6, Nodegroup: 1)
id=6 @10.0.0.4  (mysql-5.1.34 ndb-7.0.6, Nodegroup: 1)

We could potentially start with nodes [ (3 or 4) and (5 or 6) ]. In this example, we will start without node3:

This example assumes that your cluster is already shut down.

Run the following command on the nodes that you want to start:

[root@node4 ~]# ndbd --nowait-nodes=3
2009-07-09 23:32:02 [ndbd] INFO     -- Configuration fetched from '10.0.0.5:1186', generation: 1

The cluster should start without node3:

ndb_mgm> ALL STATUS
Node 3: not connected
Node 4: started (mysql-5.1.34 ndb-7.0.6)
Node 5: started (mysql-5.1.34 ndb-7.0.6)
Node 6: started (mysql-5.1.34 ndb-7.0.6)

Clearly, at this point, the cluster no longer has a single point of failure and as quickly as possible node3 should be repaired and started.