Book Image

Instant MapReduce Patterns - Hadoop Essentials How-to

By : Liyanapathirannahelage H Perera
Book Image

Instant MapReduce Patterns - Hadoop Essentials How-to

By: Liyanapathirannahelage H Perera

Overview of this book

MapReduce is a technology that enables users to process large datasets and Hadoop is an implementation of MapReduce. We are beginning to see more and more data becoming available, and this hides many insights that might hold key to success or failure. However, MapReduce has the ability to analyze this data and write code to process it.Instant MapReduce Patterns – Hadoop Essentials How-to is a concise introduction to Hadoop and programming with MapReduce. It is aimed to get you started and give you an overall feel for programming with Hadoop so that you will have a well-grounded foundation to understand and solve all of your MapReduce problems as needed.Instant MapReduce Patterns – Hadoop Essentials How-to will start with the configuration of Hadoop before moving on to writing simple examples and discussing MapReduce programming patterns.We will start simply by installing Hadoop and writing a word count program. After which, we will deal with the seven styles of MapReduce programs: analytics, set operations, cross correlation, search, graph, Joins, and clustering. For each case, you will learn the pattern and create a representative example program. The book also provides you with additional pointers to further enhance your Hadoop skills.
Table of Contents (7 chapters)

Installing Hadoop in a distributed setup and running a word count application (Simple)


The following figure shows a typical Hadoop deployment. A Hadoop deployment consists of a single name node, multiple data nodes, a single job tracker, and multiple task trackers. Let us look at each type of node.

The name node and data nodes provide the HDFS filesystem where data nodes hold the actual data and the name node holds information about which file is in which data node. A user, who wants to read a file, first talks to the name node, finds where the file is located, and then talks to data nodes to access the file.

Similarly, the job tracker keeps track of MapReduce jobs and schedules the individual map and reduces tasks in the Task Trackers. Users submit the jobs to the Job Tracker, which runs them in the Task Trackers. However, it is possible to run all these types of servers in a single node or in multiple nodes.

This recipe explains how to set up your own Hadoop cluster. For the setup, we need to configure job trackers and task trackers and then point to the task trackers in the HADOOP_HOME/conf/slaves file of the job tracker. When we start the job tracker, it will start the task tracker nodes. Let us look at the deployment in detail:

Getting ready

  1. You need at least one Linux or Mac OS X machine for the setup. You may follow this recipe either using a single machine or multiple machines. If you are using multiple machines, you should choose one machine as the master node and the other nodes as slave nodes. You will run the HDFS name node and job tracker in the master node. If you are using a single machine, use it as both the master node as well as the slave node.

  2. Install Java in all machines that will be used to set up Hadoop.

How to do it...

  1. In each machine, create a directory for Hadoop data, which we will call HADOOP_DATA_DIR. Then, let us create three subdirectories HADOOP_DATA/data, HADOOP_DATA/local, HADOOP_DATA/name.

  2. Set up the SSH key to enable SSH from master nodes to slave nodes. Check that you can SSH to the localhost and to all other nodes without a passphrase by running the following command.

    >ssh localhost (or sshIPaddress)
    
  3. If the preceding command returns an error or asks for a password, create SSH keys by executing the following commands:

    >ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
    
  4. Then move the ~/.ssh/id_dsa.pub file to all the nodes in the cluster. Add the SSH keys to the ~/.ssh/authorized_keys file in each node by running the following command:

    >cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
    
  5. Then you can log in with the following command:

    >ssh localhost
    
  6. Unzip the Hadoop distribution at the same location in all machines using the following command:

    >tar -zxvf hadoop-1.0.0.tar.gz.  
    
  7. In all machines, edit the HADOOP_HOME/conf/hadoop-env.sh file by uncommenting the line with JAVA_HOME and to point to your local Java installation. For example, if Java is in /opt/jdk1.6, change the line to export JAVA_HOME=/opt/jdk1.6.

  8. Place the IP address of the node used as the master (for running job tracker and name node) in HADOOP_HOME/conf/masters in a single line. If you are doing a single node deployment, leave the current value of localhost as it is.

    209.126.198.72
    
  9. Place the IP addresses of all slave nodes in the HADOOP_HOME/conf/slaves file each in a separate line.

    209.126.198.72
    209.126.198.71
    
  10. Inside each node's HADOOP_HOME/conf directory, add the following to the core-site.xml, hdfs-site.xml, and mapred-site.xml files. Before adding the configurations, replace MASTER_NODE with the IP of the master node and HADOOP_DATA_DIR with the directory you created in step 1.

  11. Add the URL of the name node to HADOOP_HOME/conf/core-site.xml as follows:

    <configuration>
    <property>
    <name>fs.default.name</name>
    <value>hdfs://MASTER_NODE:9000/</value>
    </property>
    </configuration>
  12. Add locations to store metadata (names) and data within HADOOP_HOME/conf/hdfs-site.xml as follows:

    <configuration>
    <property>
    <name>dfs.name.dir</name>
    <value>HADOOP_DATA_DIR/name</value>
    </property>
    <property>
    <name>dfs.data.dir</name>
    <value>HADOOP_DATA_DIR/data</value>
    </property>
    </configuration>
  13. The MapReduce local directory is the location used by Hadoop to store temporary files. Also add job tracker location to HADOOP_HOME/conf/mapred-site.xml. The Hadoop client will use this job tracker when submitting jobs. The final property sets the maximum map tasks per node. You should set this same as the amount of cores (CPU) in the machine.

    <configuration>
    <property>
    <name>mapred.job.tracker</name>
    <value>MASTER_NODE:9001</value>
    </property>
    <property>
    <name>mapred.local.dir</name>
    <value>HADOOP_DATA_DIR/local</value>
    </property>
    <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>8</value>
    </property>
    </configuration>
  14. Format a new HDFS filesystem by running the following command from the Hadoop name node (aka master node).

    >run bin/hadoopnamenode –format
    ...
    /Users/srinath/playground/hadoop-book/hadoop-temp/dfs/name has been successfully formatted.
    
  15. In the master node, change the directory to HADOOP_HOME and run the following commands:

    >bin/start-dfs.sh 
    >bin/start-mapred.sh 
    
  16. Verify the installation by listing processes through the ps | grep java command. The master node will list three processes: name node, data node, job tracker, and task tracker and the salves will have a data node and task tracker.

  17. Browse the Web-based monitoring pages for the name node and job tracker, NameNode – http://MASTER_NODE:50070/ and JobTracker – http://MASTER_NODE:50030/.

  18. You can find the log files in ${HADOOP_HOME}/logs.

  19. Make sure the HDFS setup is OK by listing the files using HDFS command line.

    bin/hadoopdfs -ls /
    Found 2 items
    drwxr-xr-x   - srinathsupergroup    0 2012-04-09 08:47 /Users
    drwxr-xr-x   - srinathsupergroup    0 2012-04-09 08:47 /tmp
    
  20. Download the weblog dataset from http://snap.stanford.edu/data/bigdata/amazon/amazon-meta.txt.gz and unzip it. We call this DATA_DIR. The dataset will be about 1 gigabyte, and if you want your executions to finish faster, you can only use a subset of the dataset.

  21. Copy the hadoop-microbook.jar file from SAMPLE_DIR to HADOOP_HOME.

  22. If you have not already done so, let us upload the amazon dataset to the HDFS filesystem using following commands:

    >bin/hadoopdfs -mkdir /data/
    >bin/hadoopdfs -mkdir /data/amazon-dataset
    >bin/hadoopdfs -put <SAMPLE_DIR>/amazon-meta.txt /data/amazon-dataset/
    >bin/hadoopdfs -ls /data/amazon-dataset
    
  23. Run the MapReduce job through the following command from HADOOP_HOME:

    $ bin/hadoop jar hadoop-microbook.jar  microbook.wrodcount.WordCount /data/amazon-dataset /data/wordcount-doutput
    
  24. Your can find the results of the MapReduce job from the output directory. Use the following command to list the content:

    $ bin/hadoop jar hadoop-microbook.jar dfs –ls /data/wordcount-doutput 
    

How it works...

As described in the introduction to the chapter, Hadoop installation consists of HDFS nodes, a job tracker, and worker nodes. When we start the name node, it finds salves through HADOOP_HOME/salves file and uses SSH to start the data nodes in the remote server. Also when we start the job tracker, it finds salves through the HADOOP_HOME/salves file and starts the task trackers.

When we run the MapReduce job, the client finds the job tracker from the configuration and submits the jobs to the job tracker. The clients wait for the execution to finish and keep receiving standard out and prints it to the console as long as the job is running.