Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Overview of this book

Table of Contents (19 chapters)
Hadoop MapReduce v2 Cookbook Second Edition
Credits
About the Author
Acknowledgments
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Setting up HDFS


HDFS is a block structured distributed filesystem that is designed to store petabytes of data reliably on top of clusters made out of commodity hardware. HDFS supports storing massive amounts of data and provides high throughput access to the data. HDFS stores file data across multiple nodes with redundancy to ensure fault-tolerance and high aggregate bandwidth.

HDFS is the default distributed filesystem used by the Hadoop MapReduce computations. Hadoop supports data locality aware processing of data stored in HDFS. HDFS architecture consists mainly of a centralized NameNode that handles the filesystem metadata and DataNodes that store the real data blocks. HDFS data blocks are typically coarser grained and perform better with large streaming reads.

To set up HDFS, we first need to configure a NameNode and DataNodes, and then specify the DataNodes in the slaves file. When we start the NameNode, the startup script will start the DataNodes.

Tip

Installing HDFS directly using Hadoop release artifacts as mentioned in this recipe is recommended for development testing and for advanced use cases only. For regular production clusters, we recommend using a packaged Hadoop distribution as mentioned in the Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution recipe. Packaged Hadoop distributions make it much easier to install, configure, maintain, and update the components of the Hadoop ecosystem.

Getting ready

You can follow this recipe either using a single machine or multiple machines. If you are using multiple machines, you should choose one machine as the master node where you will run the HDFS NameNode. If you are using a single machine, use it as both the name node as well as the DataNode.

  1. Install JDK 1.6 or above (Oracle JDK 1.7 is preferred) in all machines that will be used to set up the HDFS cluster. Set the JAVA_HOME environment variable to point to the Java installation.

  2. Download Hadoop by following the Setting up Hadoop v2 on your local machine recipe.

How to do it...

Now let's set up HDFS in the distributed mode:

  1. Set up password-less SSH from the master node, which will be running the NameNode, to the DataNodes. Check that you can log in to localhost and to all other nodes using SSH without a passphrase by running one of the following commands:

    $ ssh localhost
    $ ssh <IPaddress>
    

    Tip

    Configuring password-less SSH

    If the command in step 1 returns an error or asks for a password, create SSH keys by executing the following command (you may have to manually enable SSH beforehand depending on your OS):

    $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
    

    Move the ~/.ssh/id_dsa.pub file to all the nodes in the cluster. Then add the SSH keys to the ~/.ssh/authorized_keys file in each node by running the following command (if the authorized_keys file does not exist, run the following command. Otherwise, skip to the cat command):

    $ touch ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys
    

    Now with permissions set, add your key to the ~/.ssh/authorized_keys file:

    $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
    

    Then you should be able to execute the following command successfully, without providing a password:

    $ ssh localhost
    
  2. In each server, create a directory for storing HDFS data. Let's call that directory {HADOOP_DATA_DIR}. Create two subdirectories inside the data directory as {HADOOP_DATA_DIR}/data and {HADOOP_DATA_DIR}/name. Change the directory permissions to 755 by running the following command for each directory:

    $ chmod –R 755 <HADOOP_DATA_DIR>
    
  3. In the NameNode, add the IP addresses of all the slave nodes, each on a separate line, to the{HADOOP_HOME}/etc/hadoop/slaves file. When we start the NameNode, it will use this slaves file to start the DataNodes.

  4. Add the following configurations to {HADOOP_HOME}/etc/hadoop/core-site.xml. Before adding the configurations, replace the {NAMENODE} strings with the IP of the master node:

    <configuration>
      <property>
        <name>fs.defaultFS</name>
        <value>hdfs://{NAMENODE}:9000/</value>
      </property>
    </configuration>
  5. Add the following configurations to the {HADOOP_HOME}/etc/hadoop/hdfs-site.xml files in the {HADOOP_HOME}/etc/hadoop directory. Before adding the configurations, replace the {HADOOP_DATA_DIR} with the directory you created in the first step. Replicate the core-site.xml and hdfs-site.xml files we modified in steps 4 and 5 to all the nodes.

    <configuration>
      <property>
        <name>dfs.namenode.name.dir</name>
        <!-- Path to store namespace and transaction logs -->
        <value>{HADOOP_DATA_DIR}/name</value>
      </property>
      <property>
        <name>dfs.datanode.data.dir</name>
        <!-- Path to store data blocks in datanode -->
        <value>{HADOOP_DATA_DIR}/data</value>
      </property>
    </configuration>
  6. From the NameNode, run the following command to format a new filesystem:

    $ $HADOOP_HOME/bin/hdfs namenode –format
    

    You will see the following line in the output after the successful completion of the previous command:

    …
    13/04/09 08:44:51 INFO common.Storage: Storage directory /…/dfs/name has been successfully formatted.
    ….
  7. Start the HDFS using the following command:

    $ $HADOOP_HOME/sbin/start-dfs.sh
    

    This command will first start a NameNode in the master node. Then it will start the DataNode services in the machines mentioned in the slaves file. Finally, it'll start the secondary NameNode.

  8. HDFS comes with a monitoring web console to verify the installation and to monitor the HDFS cluster. It also lets users explore the contents of the HDFS filesystem. The HDFS monitoring console can be accessed from http://{NAMENODE}:50070/. Visit the monitoring console and verify whether you can see the HDFS startup page. Here, replace {NAMENODE} with the IP address of the node running the HDFS NameNode.

  9. Alternatively, you can use the following command to get a report about the HDFS status:

    $ $HADOOP_HOME/bin/hadoop dfsadmin -report
    
  10. Finally, shut down the HDFS cluster using the following command:

    $ $HADOOP_HOME/sbin/stop-dfs.sh
    

See also

  • In the HDFS command-line file operations recipe, we will explore how to use HDFS to store and manage files.

  • The HDFS setup is only a part of the Hadoop installation. The Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2 recipe describes how to set up the rest of Hadoop.

  • The Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution recipe explores how to use a packaged Hadoop distribution to install the Hadoop ecosystem in your cluster.