Hadoop and HBase are designed to handle the failover of their slave nodes automatically. Because there may be many nodes in a large cluster, a hardware failure of a server or shut down of a slave node are considered as normal in the cluster.
For the master nodes, HBase itself has no SPOF. HBase uses ZooKeeper as its central coordination service. A ZooKeeper ensemble is typically clustered with three or more servers; as long as more than half of the servers in the cluster are online, ZooKeeper can provide its service normally.
HBase saves its active master node, root region server location, and other important running data in ZooKeeper. Therefore, we can just start two or more HMaster
daemons on separate servers and the one started first will be the active master server of the HBase cluster.
But, NameNode of HDFS is the SPOF of the cluster. NameNode keeps the entire HDFS's filesystem image in its local memory. HDFS cannot function anymore if NameNode is down, as HBase is down too. As you may notice, there is a Secondary NameNode of HDFS. Note that Secondary NameNode is not a standby of NameNode, it just provides a checkpoint function to NameNode. So, the challenge of a highly available cluster is to make NameNode highly available.
In this recipe, we will describe the setup of two highly available master nodes, which will use Heartbeat to monitor each other. Heartbeat is a widely used HA solution to provide communication and membership for a Linux cluster. Heartbeat needs to be combined with a Cluster Resource Manager (CRM) to start/stop services for that cluster. Pacemaker is the preferred cluster resource manager for Heartbeat. We will set up a Virtual IP (VIP) address using Heartbeat and Pacemaker, and then associate it with the active master node. Because EC2 does not support static IP addresses, we cannot demonstrate it on EC2, but we will discuss an alternative way of using Elastic IP (EIP) to achieve our purpose.
We will focus on setting up NameNode and HBase; you can simply use a similar method to set up two JobTracker nodes as well.
You should already have HDFS and HBase installed. We will set up a standby master node (master2
), as you need another server ready to use. Make sure all the dependencies have been configured properly. Sync your Hadoop and HBase root directory from the active master (master1
) to the standby master.
We will need NFS in this recipe as well. Set up your NFS server, and mount the same NFS directory from both master1
and master2
. Make sure the hadoop
user has write permission to the NFS directory. Create a directory on NFS to store Hadoop's metadata. We assume the directory is /mnt/nfs/hadoop/dfs/name
.
We will set up VIP for the two masters, and assume you have the following IP addresses and DNS mapping:
master1:
This has its IP address as 10.174.14.11.master2:
This has its IP address as 10.174.14.12.master:
This has its IP address as 10.174.14.10. It is the VIP that will be set up later.
The following instructions describe how to set up two highly available master nodes.
First, we will install Heartbeat and Pacemaker, and make some basic configurations:
1. Install Heartbeat and Pacemaker on
master1
andmaster2:
root# apt-get install heartbeat cluster-glue cluster-agents pacemaker
2. To configure Heartbeat, make the following changes to both
master1
andmaster2:
root# vi /etc/ha.d/ha.cf # enable pacemaker, without stonith crm yes # log where ? logfacility local0 # warning of soon be dead warntime 10 # declare a host (the other node) dead after: deadtime 20 # dead time on boot (could take some time until net is up) initdead 120 # time between heartbeats keepalive 2 # the nodes node master1 node master2 # heartbeats, over dedicated replication interface! ucast eth0 master1 # ignored by master1 (owner of ip) ucast eth0 master2 # ignored by master2 (owner of ip) # ping the name server to assure we are online ping ns
3. Create an
authkeys
file. Execute the following script as aroot
user onmaster1
andmaster2:
root# ( echo -ne "auth 1\n1 sha1 "; \ dd if=/dev/urandom bs=512 count=1 | openssl md5 ) \ > /etc/ha.d/authkeys root# chmod 0600 /etc/ha.d/authkeys
Pacemaker depends on a resource agent to manager the cluster. A resource agent is an executable that manages a cluster resource. In our case, the VIP address and the HDFS NameNode service is the cluster resource we want to manage, using Pacemaker. Pacemaker ships with an IPaddr
resource agent to manage VIP, so we only need to create our own namenode
resource agent:
1. Add environment variables to the
.bashrc
file of theroot
user onmaster1
andmaster2
. Don't forget to apply the changes:root# vi /root/.bashrc export JAVA_HOME=/usr/local/jdk1.6 export HADOOP_HOME=/usr/local/hadoop/current export OCF_ROOT=/usr/lib/ocf
Invoke the following command to apply the previous changes:
root# source /root/.bashrc
2. Create a standard Open Clustering Framework (OCF) resource agent file called
namenode
, with the following content.The
namenode
resource agent starts with including standard OCF functions such as the following:root# vi namenode #!/bin/sh : ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/resource.d/heartbeat} . ${OCF_FUNCTIONS_DIR}/.ocf-shellfuncs usage() { echo "Usage: $0 {start|stop|status|monitor|meta-data|validate-all}" }
3. Add a
meta_data()
function as shown in the following code. Themeta_data()
function dumps the resource agent metadata to standard output. Every resource agent must have a set of XML metadata describing its own purpose and supported parameters:root# vi namenode meta_data() {cat <<END <?xml version="1.0"?> <!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd"> <resource-agent name="namenode"> <version>0.1</version> <longdesc lang="en"> This is a resource agent for NameNode. It manages HDFS namenode daemon. </longdesc> <shortdesc lang="en">Manage namenode daemon.</shortdesc> <parameters></parameters> <actions> <action name="start" timeout="120" /> <action name="stop" timeout="120" /> <action name="status" depth="0" timeout="120" interval="120" /> <action name="monitor" depth="0" timeout="120" interval="120" /> <action name="meta-data" timeout="10" /> <action name="validate-all" timeout="5" /> </actions> </resource-agent> END }
4. Add a
namenode_start()
function. This function is used by Pacemaker to actually start the NameNode daemon on the server. In thenamenode_start()
function, we firstly check whether NameNode is already started on the server; if it is not started, we invokehadoop-daemon.sh
from thehadoop
user to start it:root# vi namenode namenode_start() { # if namenode is already started on this server, bail out early namenode_status if [ $? -eq 0 ]; then ocf_log info "namenode is already running on this server, skip" return $OCF_SUCCESS fi # start namenode on this server ocf_log info "Starting namenode daemon..." su - hadoop -c "${HADOOP_HOME}/bin/hadoop-daemon.sh start name node" if [ $? -ne 0 ]; then ocf_log err "Can not start namenode daemon." return $OCF_ERR_GENERIC; fi sleep 1 return $OCF_SUCCESS }
5. Add a
namenode_stop()
function. This function is used by Pacemaker to actually stop the NameNode daemon on the server. In thenamenode_stop()
function, we first check whether NameNode is already stopped on the server; if it is running, we invokehadoop-daemon.sh
from thehadoop
user to stop it:root# vi namenode namenode_stop () { # if namenode is not started on this server, bail out early namenode_status if [ $? -ne 0 ]; then ocf_log info "namenode is not running on this server, skip" return $OCF_SUCCESS fi # stop namenode on this server ocf_log info "Stopping namenode daemon..." su - hadoop -c "${HADOOP_HOME}/bin/hadoop-daemon.sh stop name node" if [ $? -ne 0 ]; then ocf_log err "Can not stop namenode daemon." return $OCF_ERR_GENERIC; fi sleep 1 return $OCF_SUCCESS }
6. Add a
namenode_status()
function. This function is used by Pacemaker to monitor the status of the NameNode daemon on the server. In thenamenode_status()
function, we use thejps
command to show all running Java processes owned by thehadoop
user, and thegrep
name of the NameNode daemon to see whether it has started:root# vi namenode namenode_status () { ocf_log info "monitor namenode" su - hadoop -c "${JAVA_HOME}/bin/jps" | egrep -q "NameNode" rc=$? # grep will return true if namenode is running on this machine if [ $rc -eq 0 ]; then ocf_log info "Namenode is running" return $OCF_SUCCESS else ocf_log info "Namenode is not running" return $OCF_NOT_RUNNING fi }
7. Add a
namenode_validateAll()
function to make sure the environment variables are set properly before we run other functions:root# vi namenode namenode_validateAll () { if [ -z "$JAVA_HOME" ]; then ocf_log err "JAVA_HOME not set." exit $OCF_ERR_INSTALLED fi if [ -z "$HADOOP_HOME" ]; then ocf_log err "HADOOP_HOME not set." exit $OCF_ERR_INSTALLED fi # Any subject is OK return $OCF_SUCCESS }
8. Add the following main routine. Here, we will simply call the previous functions to implement the required standard OCF resource agent actions:
root# vi namenode # See how we were called. if [ $# -ne 1 ]; then usage exit $OCF_ERR_GENERIC fi namenode_validateAll case $1 in meta-data) meta_data exit $OCF_SUCCESS;; usage) usage exit $OCF_SUCCESS;; *);; esac case $1 in status|monitor) namenode_status;; start) namenode_start;; stop) namenode_stop;; validate-all);; *)usage exit $OCF_ERR_UNIMPLEMENTED;; esac exit $?
9. Change the
namenode
file permission and test it onmaster1
andmaster2:
root# chmod 0755 namenode root# ocf-tester -v -n namenode-test /full/path/of/namenode
10. Make sure all the tests are passed before proceeding to the next step, or the HA cluster will behave unexpectedly.
11. Install the
namenode
resource agent under thehac
provider onmaster1
andmaster2:
root# mkdir ${OCF_ROOT}/resource.d/hac root# cp namenode ${OCF_ROOT}/resource.d/hac root# chmod 0755 ${OCF_ROOT}/resource.d/hac/namenode
We are ready to configure highly available NameNode using Heartbeat and Pacemaker. We will set up a VIP address and configure Hadoop and HBase to use this VIP address as their master node. NameNode will be started on the active master where VIP is assigned. If active master has crashed, Heartbeat and Pacemaker will detect it and assign the VIP address to the standby master node, and then start NameNode there.
1. Start Heartbeat on
master1
andmaster2:
root# /etc/init.d/heartbeat start
2. Change the default
crm
configuration. All resource-related commands are only executed once, frommaster1
ormaster2:
root# crm configure property stonith-enabled=false root# crm configure property default-resource-stickiness=1
3. Add a VIP resource using our VIP address:
root# crm configure primitive VIP ocf:heartbeat:IPaddr params ip="10.174.14.10" op monitor interval="10s"
4. Make the following changes to configure Hadoop to use our VIP address. Sync to all masters, clients, and slaves after you've made the changes:
hadoop$ vi $HADOOP_HOME/conf/core-site.xml <property> <name>fs.default.name</name> <value>hdfs://master:8020</value> </property>
5. Make the following changes to configure HBase to use our VIP address. Sync to all masters, clients, and slaves after you've made the changes:
hadoop$ vi $HBASE_HOME/conf/hbase-site.xml <property> <name>hbase.rootdir</name> <value>hdfs://master:8020/hbase</value> </property>
6. To configure Hadoop to write its metadata to a local disk and NFS, make the following changes and sync to all masters, clients, and slaves:
hadoop$ vi $HADOOP_HOME/conf/hdfs-site.xml <property> <name>dfs.name.dir</name> <value>/usr/local/hadoop/var/dfs/name,/mnt/nfs/hadoop /dfs/name</value> </property>
7. Add the
namenode
resource agent we created in step 5 to Pacemaker. We will useNAMENODE
as its resource name:root# crm configure primitive NAMENODE ocf:hac:namenode op monitor interval="120s" timeout="120s" op start timeout="120s" op stop timeout="120s" meta resource-stickiness="1"
8. Configure the
VIP
resource and theNAMENODE
resource as a resource group:root# crm configure group VIP-AND-NAMENODE VIP NAMENODE
9. Configure
colocation
of a VIP resource and theNAMENODE
resource:root# crm configure colocation VIP-WITH-NAMENODE inf: VIP NAMENODE
10. Configure the resource order of the VIP resource and the
NAMENODE
resource:root# crm configure order IP-BEFORE-NAMENODE inf: VIP NAMENODE
11. Verify the previous Heartbeat and resource configurations by using the
crm_mon
command. If everything is configured properly, you should see an output like the following :root@master1 hac$ crm_mon -1r ============ Last updated: Tue Nov 22 22:39:11 2011 Stack: Heartbeat Current DC: master2 (7fd92a93-e071-4fcb-993f-9a84e6c7846f) - partition with quorum Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b 2 Nodes configured, 1 expected votes 1 Resources configured. ============ Online: [ master1 master2 ] Full list of resources: Resource Group: VIP-AND-NAMENODE VIP (ocf::heartbeat:IPaddr): Started master1 NAMENODE (ocf::hac:namenode): Started master1
12. Make sure that the
VIP
andNAMENODE
resources are started on the same server.13. Now stop Heartbeat from
master1; VIP-AND-NAMENODE
should be started atmaster2
after several seconds.14. Restart Heartbeat from
master1; VIP-AND-NAMENODE
should remain started atmaster2
. Resources should NOT failback tomaster1
.
We have confirmed that our HA configuration works as expected, so we can start HDFS and HBase now. Note that NameNode has already been started by Pacemaker, so we need only start DataNode here:
1. If everything works well, we can start DataNode now:
hadoop@master$ for i in 1 2 3 do ssh slave$i "$HADOOP_HOME/bin/hadoop-daemon.sh start datanode" sleep 1 done
2. Start your HBase cluster from
master
, which is the active master server where the VIP address is associated:hadoop@master$ $HBASE_HOME/bin/start-hbase.sh
3. Start standby HMaster from the standby master server,
master2
in this case:hadoop@master2$ $HBASE_HOME/bin/hbase-daemon.sh start master
The previous steps finally leave us with a cluster structure like the following diagram:
At first, we installed Heartbeat and Pacemaker on the two masters and then configured Heartbeat to enable Pacemaker.
In step 2 of the Create and install a NameNode resource agent section, we created the namenode
script, which is implemented as a standard OCF resource agent. The most important function of the namenode
script is namenode_status
, which monitors the status of the NameNode daemon. Here we use the jps
command to show all running Java processes owned by the hadoop
user, and the grep
name of the NameNode daemon to see if it has started. The namenode
resource agent is used by Pacemaker to start/stop/monitor the NameNode daemon. In the namenode
script, as you can see in the namenode_start
and namenode_stop
methods, we actually start/stop NameNode by using hadoop-daemon.sh
, which is used to start/stop the Hadoop
daemon on a single server. You can find a full list of the code from the source shipped with this book.
We started Heartbeat after our namenode
resource agent was tested and installed. Then, we made some changes to the default crm
configurations. The default-resource-stickiness=1
setting is very important as it turns off
the automatic failback of a resource.
We added a VIP
resource to Pacemaker and configured Hadoop and HBase to use it in steps 3 to 5 of the Configure highly available NameNode section. By using VIP in their configuration, Hadoop and HBase can switch to communicate with the standby master if the active one is down.
In step 6 of the same section, we configured Hadoop (HDFS NameNode) to write its metadata to both the local disk and NFS. If an active master is down, NameNode will be started from the standby master. Because they were mounted on the same NFS directory, NameNode started from the standby master can apply the latest metadata from NFS, and restore HDFS to the status before the original active master is down.
In steps 7 to 10, we added the NAMENODE
resource using the namenode
resource agent we created in step 2 of the Create and install a NameNode resource agent section, then we set up VIP
and NAMENODE
resources as a group (step 8), and made sure they always run on the same server (step 9), with the right start-up order (step 10). We did this because we didn't want VIP running on master1
, while NameNode was running on master2
.
Because Pacemaker will start NameNode for us via the namenode
resource agent, we need to start DataNode separately, which is what we did in step 1 of the Start DataNode, HBase cluster, and backup HBase master section.
After starting HBase normally, we started our standby HBase master (HMaster) on the standby master server. If you check your HBase master log, you will find output like the following, which shows itself as a standby HMaster:
2011-11-21 23:38:55,168 INFO org.apache.hadoop.hbase.master.ActiveMasterManager: Another master is the active master, ip-10-174-14-15.us-west-1.compute.internal:60000; waiting to become the next active master
Finally, we got NameNode and HMaster running on two servers with an active-standby configuration. The single point of failure of the cluster was avoided.
However, it leaves us with lots of works to do in production. You need to test your HA cluster in all rare cases, such as a server power off, unplug of a network cable, shut down of network switch, or anything else you can think of.
On the other hand, SPOF of the cluster may not be as critical as you think. Based on our experience, almost all of the downtime of the cluster is due to an operational miss or software upgrade. It's better to make your cluster simple.
It is more complex to set up a highly available HBase cluster on Amazon EC2 because EC2 does not support static IP addresses, and so we can't use VIP on EC2. An alternative way is to use an Elastic IP address. An Elastic IP address is the role of a static IP address on EC2 while it is associated with your account, not a particular instance. We can use Heartbeat to associate EIP to the standby master automatically, if the active one is down. Then, we configure Hadoop and HBase to use an instance's public DNS associated with EIP, to find an active master. Also, in the namenode
resource agent, we have to start/stop not only NameNode, but also all DataNodes. This is because the IP address of an active master has changed, but DataNode cannot find the new active master unless it is restarted.
We will skip the details because it's out of the scope of this book. We created an elastic-ip
resource agent to achieve this purpose. You can find it in the source shipped with this book.