A distributed HBase depends on a running ZooKeeper cluster. All HBase cluster nodes and clients need to be able to access the ZooKeeper ensemble.
This recipe describes how to set up a ZooKeeper cluster. We will only set up a standalone ZooKeeper node for our HBase cluster, but in production it is recommended that you run a ZooKeeper ensemble of at least three nodes. Also, make sure to run an odd number of nodes.
We will cover the setting up of a clustered ZooKeeper in the There's more... section of this recipe.
First, make sure Java is installed in your ZooKeeper server.
We will use the hadoop
user as the owner of all ZooKeeper daemons and files. All the ZooKeeper files and data will be stored under /usr/local/ZooKeeper
; you need to create this directory in advance. Our ZooKeeper will be set up on master1
too.
We will set up one ZooKeeper client on client1
. So, the Java installation, hadoop
user, and directory should be prepared on client1
as well.
To set up a standalone ZooKeeper installation, follow these instructions:
1. Get the latest stable ZooKeeper release from ZooKeeper's official site, http://ZooKeeper.apache.org/releases.html#download.
2. Download the tarball and decompress it to our root directory for ZooKeeper. We will set a
ZK_HOME
environment variable to make the setup easier. As of this writing, ZooKeeper 3.4.3 is the latest stable version:hadoop@master1$ ln -s ZooKeeper-3.4.3 current hadoop@master1$ export ZK_HOME=/usr/local/ZooKeeper/current
3. Create directories for ZooKeeper to store its snapshot and transaction log:
hadoop@master1$ mkdir -p /usr/local/ZooKeeper/data hadoop@master1$ mkdir -p /usr/local/ZooKeeper/datalog
4. Create the
$ZK_HOME/conf/java.env
file and put the Java settings there:hadoop@master1$ vi $ZK_HOME/conf/java.env JAVA_HOME=/usr/local/jdk1.6 export PATH=$JAVA_HOME/bin:$PATH
5. Copy the sample ZooKeeper setting file, and make the following changes to set where ZooKeeper should store its data:
hadoop@master1$ cp $ZK_HOME/conf/zoo_sample.cfg $ZK_HOME/conf/zoo.cfg hadoop@master1$ vi $ZK_HOME/conf/zoo.cfg dataDir=/usr/local/ZooKeeper/var/data dataLogDir=/usr/local/ZooKeeper/var/datalog
6. Sync all files under
/usr/local/ZooKeeper
from the master node to the client. Don't sync${dataDir}
and${dataLogDir}
after this initial installation.7. Start ZooKeeper from the master node by executing this command:
hadoop@master1$ $ZK_HOME/bin/zkServer.sh start
8. Connect to the running ZooKeeper, and execute some commands to verify the installation:
hadoop@client1$ $ZK_HOME/bin/zkCli.sh -server master1:2181 [zk: master1:2181(CONNECTED) 0] ls / [ZooKeeper] [zk: master1:2181(CONNECTED) 1] quit
9. Stop ZooKeeper from the master node by executing the following command:
hadoop@master1$ $ZK_HOME/bin/zkServer.sh stop
In this recipe, we set up a basic standalone ZooKeeper instance. As you can see, the setting is very simple; all you need to do is to tell ZooKeeper where to find Java and where to save its data.
In step 4, we created a file named java.env
and placed the Java settings in this file. You must use this filename as ZooKeeper, which by default, gets its Java settings from this file.
ZooKeeper's settings file is called zoo.cfg
. You can copy the settings from the sample file shipped with ZooKeeper. The default setting is fine for basic installation. As ZooKeeper always acts as a central role in a cluster system, it should be set up properly to gain the best performance.
To connect to a running ZooKeeper ensemble, use its command-line tool, and specify the ZooKeeper server and port you want to connect to. The default client port is 2181
. You don't need to specify it, if you are using the default port setting.
All ZooKeeper data is called a Znode. Znodes are constructed like a filesystem hierarchy. ZooKeeper provides commands to access or update Znode from its command-line tool; type help
for more information.
As HBase relays ZooKeeper as its coordination service, the ZooKeeper service must be extremely reliable. In production, you must run a ZooKeeper cluster of at least three nodes. Also, make sure to run an odd number of nodes.
The procedure to set up a clustered ZooKeeper is basically the same as shown in this recipe. You can follow the previous steps to set up each cluster node at first. Add the following settings to each node's zoo.cfg
, so that every node knows about every other node in the ensemble:
hadoop@node{1,2,3}$ vi $ZK_HOME/conf/zoo.cfg
server.1=node1:2888:3888 server.2=node2:2888:3888 server.3=node3:2888:3888
Also, you need to put a myid
file under ${dataDir}
. The myid
file consists of a single line containing only the node ID. So myid
of node1
would contain the text 1
and nothing else.
Note
Note that clocks on all ZooKeeper nodes must be synchronized. You can use Network Time Protocol (NTP) to have the clocks synchronized.
Start ZooKeeper from each node of your cluster respectively. Then, you can connect to the cluster from your client, by using the following command:
$ zkCli.sh -server node1,node2,node3
ZooKeeper will function as long as more than half of the nodes in the ZooKeeper cluster are alive. This means, in a three node cluster, only one server can die.