The following figure shows a typical Hadoop deployment. A Hadoop deployment consists of a single name node, multiple data nodes, a single job tracker, and multiple task trackers. Let us look at each type of node.
The name node and data nodes provide the HDFS filesystem where data nodes hold the actual data and the name node holds information about which file is in which data node. A user, who wants to read a file, first talks to the name node, finds where the file is located, and then talks to data nodes to access the file.
Similarly, the job tracker keeps track of MapReduce jobs and schedules the individual map and reduces tasks in the Task Trackers. Users submit the jobs to the Job Tracker, which runs them in the Task Trackers. However, it is possible to run all these types of servers in a single node or in multiple nodes.
This recipe explains how to set up your own Hadoop cluster. For the setup, we need to configure job trackers and task trackers and then point to the task trackers in the HADOOP_HOME/conf/slaves
file of the job tracker. When we start the job tracker, it will start the task tracker nodes. Let us look at the deployment in detail:
You need at least one Linux or Mac OS X machine for the setup. You may follow this recipe either using a single machine or multiple machines. If you are using multiple machines, you should choose one machine as the master node and the other nodes as slave nodes. You will run the HDFS name node and job tracker in the master node. If you are using a single machine, use it as both the master node as well as the slave node.
Install Java in all machines that will be used to set up Hadoop.
In each machine, create a directory for Hadoop data, which we will call
HADOOP_DATA_DIR
. Then, let us create three subdirectoriesHADOOP_DATA/data
,HADOOP_DATA/local
,HADOOP_DATA/name
.Set up the SSH key to enable SSH from master nodes to slave nodes. Check that you can SSH to the localhost and to all other nodes without a passphrase by running the following command.
>ssh localhost (or sshIPaddress)
If the preceding command returns an error or asks for a password, create SSH keys by executing the following commands:
>ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
Then move the
~/.ssh/id_dsa.pub
file to all the nodes in the cluster. Add the SSH keys to the~/.ssh/authorized_keys
file in each node by running the following command:>cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Then you can log in with the following command:
>ssh localhost
Unzip the Hadoop distribution at the same location in all machines using the following command:
>tar -zxvf hadoop-1.0.0.tar.gz.
In all machines, edit the
HADOOP_HOME/conf/hadoop-env.sh
file by uncommenting the line withJAVA_HOME
and to point to your local Java installation. For example, if Java is in/opt/jdk1.6
, change the line toexport JAVA_HOME=/opt/jdk1.6
.Place the IP address of the node used as the master (for running job tracker and name node) in
HADOOP_HOME/conf/masters
in a single line. If you are doing a single node deployment, leave the current value oflocalhost
as it is.209.126.198.72
Place the IP addresses of all slave nodes in the
HADOOP_HOME/conf/slaves
file each in a separate line.209.126.198.72 209.126.198.71
Inside each node's
HADOOP_HOME/conf
directory, add the following to thecore-site.xml
,hdfs-site.xml
, andmapred-site.xml
files. Before adding the configurations, replaceMASTER_NODE
with the IP of the master node andHADOOP_DATA_DIR
with the directory you created in step 1.Add the URL of the name node to
HADOOP_HOME/conf/core-site.xml
as follows:<configuration> <property> <name>fs.default.name</name> <value>hdfs://MASTER_NODE:9000/</value> </property> </configuration>
Add locations to store metadata (names) and data within
HADOOP_HOME/conf/hdfs-site.xml
as follows:<configuration> <property> <name>dfs.name.dir</name> <value>HADOOP_DATA_DIR/name</value> </property> <property> <name>dfs.data.dir</name> <value>HADOOP_DATA_DIR/data</value> </property> </configuration>
The MapReduce local directory is the location used by Hadoop to store temporary files. Also add job tracker location to
HADOOP_HOME/conf/mapred-site.xml
. The Hadoop client will use this job tracker when submitting jobs. The final property sets the maximum map tasks per node. You should set this same as the amount of cores (CPU) in the machine.<configuration> <property> <name>mapred.job.tracker</name> <value>MASTER_NODE:9001</value> </property> <property> <name>mapred.local.dir</name> <value>HADOOP_DATA_DIR/local</value> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>8</value> </property> </configuration>
Format a new HDFS filesystem by running the following command from the Hadoop name node (aka master node).
>run bin/hadoopnamenode –format ... /Users/srinath/playground/hadoop-book/hadoop-temp/dfs/name has been successfully formatted.
In the master node, change the directory to
HADOOP_HOME
and run the following commands:>bin/start-dfs.sh >bin/start-mapred.sh
Verify the installation by listing processes through the
ps | grep java
command. The master node will list three processes: name node, data node, job tracker, and task tracker and the salves will have a data node and task tracker.Browse the Web-based monitoring pages for the name node and job tracker, NameNode – http://MASTER_NODE:50070/ and JobTracker – http://MASTER_NODE:50030/.
You can find the log files in
${HADOOP_HOME}/logs
.Make sure the HDFS setup is OK by listing the files using HDFS command line.
bin/hadoopdfs -ls / Found 2 items drwxr-xr-x - srinathsupergroup 0 2012-04-09 08:47 /Users drwxr-xr-x - srinathsupergroup 0 2012-04-09 08:47 /tmp
Download the weblog dataset from http://snap.stanford.edu/data/bigdata/amazon/amazon-meta.txt.gz and unzip it. We call this
DATA_DIR
. The dataset will be about 1 gigabyte, and if you want your executions to finish faster, you can only use a subset of the dataset.Copy the
hadoop-microbook.jar
file fromSAMPLE_DIR
toHADOOP_HOME
.If you have not already done so, let us upload the amazon dataset to the HDFS filesystem using following commands:
>bin/hadoopdfs -mkdir /data/ >bin/hadoopdfs -mkdir /data/amazon-dataset >bin/hadoopdfs -put <SAMPLE_DIR>/amazon-meta.txt /data/amazon-dataset/ >bin/hadoopdfs -ls /data/amazon-dataset
Run the MapReduce job through the following command from
HADOOP_HOME
:$ bin/hadoop jar hadoop-microbook.jar microbook.wrodcount.WordCount /data/amazon-dataset /data/wordcount-doutput
Your can find the results of the MapReduce job from the output directory. Use the following command to list the content:
$ bin/hadoop jar hadoop-microbook.jar dfs –ls /data/wordcount-doutput
As described in the introduction to the chapter, Hadoop installation consists of HDFS nodes, a job tracker, and worker nodes. When we start the name node, it finds salves through HADOOP_HOME/salves
file and uses SSH to start the data nodes in the remote server. Also when we start the job tracker, it finds salves through the HADOOP_HOME/salves
file and starts the task trackers.
When we run the MapReduce job, the client finds the job tracker from the configuration and submits the jobs to the job tracker. The clients wait for the execution to finish and keep receiving standard out and prints it to the console as long as the job is running.