A fully distributed HBase runs on top of HDFS. As a fully distributed HBase cluster installation, its master daemon (HMaster) typically runs on the same server as the master node of HDFS (NameNode), while its slave daemon (HRegionServer) runs on the same server as the slave node of HDFS, which is called DataNode.
Hadoop MapReduce is not required by HBase. MapReduce daemons do not need to be started. We will cover the setup of MapReduce in this recipe too, in case you like to run MapReduce on HBase. For a small Hadoop cluster, we usually have a master daemon of MapReduce (JobTracker) run on the NameNode server, and slave daemons of MapReduce (TaskTracker) run on the DataNode servers.
This recipe describes the setup of Hadoop. We will have one master node (master1
) run NameNode and JobTracker on it. We will set up three slave nodes (slave1
to slave3)
, which will run DataNode and TaskTracker on them, respectively.
You will need four small EC2 instances, which can be obtained by using the following command:
$ec2-run-instances ami-77287b32 -t m1.small -n 4 -k your-key-pair
All these instances must be set up properly, as described in the previous recipe, Getting ready on Amazon EC2. Besides the NTP and DNS setups, Java installation is required by all servers too.
We will use the hadoop
user as the owner of all Hadoop daemons and files. All Hadoop files and data will be stored under /usr/local/hadoop
. Add the hadoop
user and create a /usr/local/hadoop
directory on all the servers, in advance.
We will set up one Hadoop client node as well. We will use client1
, which we set up in the previous recipe. Therefore, the Java installation, hadoop
user, and directory should be prepared on client1
too.
Here are the steps to set up a fully distributed Hadoop cluster:
1. In order to SSH log in to all nodes of the cluster, generate the
hadoop
user's public key on the master node:hadoop@master1$ ssh-keygen -t rsa -N ""
This command will create a public key for the
hadoop
user on the master node, at~/.ssh/id_rsa.pub.
2. On all slave and client nodes, add the
hadoop
user's public key to allow SSH login from the master node:hadoop@slave1$ mkdir ~/.ssh hadoop@slave1$ chmod 700 ~/.ssh hadoop@slave1$ cat >> ~/.ssh/authorized_keys
3. Copy the
hadoop
user's public key you generated in the previous step, and paste to~/.ssh/authorized_keys
. Then, change its permission as following:hadoop@slave1$ chmod 600 ~/.ssh/authorized_keys
4. Get the latest, stable, HBase-supported Hadoop release from Hadoop's official site, http://www.apache.org/dyn/closer.cgi/hadoop/common/. While this chapter was being written, the latest HBase-supported, stable Hadoop release was 1.0.2. Download the tarball and decompress it to our
root
directory for Hadoop, then add a symbolic link, and an environment variable:hadoop@master1$ ln -s hadoop-1.0.2 current hadoop@master1$ export HADOOP_HOME=/usr/local/hadoop/current
5. Create the following directories on the master node:
hadoop@master1$ mkdir -p /usr/local/hadoop/var/dfs/name hadoop@master1$ mkdir -p /usr/local/hadoop/var/dfs/data hadoop@master1$ mkdir -p /usr/local/hadoop/var/dfs/namesecondary
6. You can skip the following steps if you don't use MapReduce:
hadoop@master1$ mkdir -p /usr/local/hadoop/var/mapred
7. Set up
JAVA_HOME
in Hadoop's environment setting file (hadoop-env.sh):hadoop@master1$ vi $HADOOP_HOME/conf/hadoop-env.sh export JAVA_HOME=/usr/local/jdk1.6
8. Add the
hadoop.tmp.dir
property tocore-site.xml:
hadoop@master1$ vi $HADOOP_HOME/conf/core-site.xml <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop/var</value> </property>
9. Add the
fs.default.name
property tocore-site.xml:
hadoop@master1$ vi $HADOOP_HOME/conf/core-site.xml <property> <name>fs.default.name</name> <value>hdfs://master1:8020</value> </property>
10. If you need MapReduce, add the
mapred.job.tracker
property tomapred-site.xml:
hadoop@master1$ vi $HADOOP_HOME/conf/mapred-site.xml <property> <name>mapred.job.tracker</name> <value>master1:8021</value> </property>
11. Add a slave server list to the
slaves
file:hadoop@master1$ vi $HADOOP_HOME/conf/slaves slave1 slave2 slave3
12. Sync all Hadoop files from the master node, to client and slave nodes. Don't sync
${hadoop.tmp.dir}
after the initial installation:hadoop@master1$ rsync -avz /usr/local/hadoop/ client1:/usr/local/hadoop/ hadoop@master1$ for i in 1 2 3 do rsync -avz /usr/local/hadoop/ slave$i:/usr/local/hadoop/ sleep 1 done
13. You need to format NameNode before starting Hadoop. Do it only for the initial installation:
hadoop@master1$ $HADOOP_HOME/bin/hadoop namenode -format
14. Start HDFS from the
master
node:hadoop@master1$ $HADOOP_HOME/bin/start-dfs.sh
15. You can access your HDFS by typing the following command:
hadoop@master1$ $HADOOP_HOME/bin/hadoop fs -ls /
You can also view your HDFS admin page from the browser. Make sure the
50070
port is opened. The HDFS admin page can be viewed athttp://master1:50070/dfshealth.jsp:
16. Start MapReduce from the master node, if needed:
hadoop@master1$ $HADOOP_HOME/bin/start-mapred.sh
17. To stop HDFS, execute the following command from the master node:
hadoop@master1$ $HADOOP_HOME/bin/stop-dfs.sh
18. To stop MapReduce, execute the following command from the master node:
hadoop@master1$ $HADOOP_HOME/bin/stop-mapred.sh
To start/stop the daemon on remote slaves from the master node, a passwordless SSH login of the hadoop
user is required. We did this in step 1.
HBase must run on a special HDFS that supports a durable sync
implementation. If HBase runs on an HDFS that has no durable sync
implementation, it may lose data if its slave servers go down. Hadoop versions later than 0.20.205, including Hadoop 1.0.2 which we have chosen, support this feature.
HDFS and MapReduce use local filesystems to store their data. We created directories required by Hadoop in step 3, and set up the path to the Hadoop's configuration file in step 5.
In steps 9 to 11, we set up Hadoop so it could find HDFS, JobTracker, and slave servers. Before starting Hadoop, all Hadoop directories and settings need to be synced with the slave servers. The first time you start Hadoop (HDFS), you need to format NameNode. Note that you should only do this at the initial installation.
At this point, you can start/stop Hadoop using its start/stop script. Here we started/stopped HDFS and MapReduce separately, in case you don't require MapReduce. You can also use $HADOOP_HOME/bin/start-all.sh
and stop-all.sh
to start/stop HDFS and MapReduce using one command.