Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Hadoop MapReduce v2 Cookbook - Second Edition: RAW
  • Table Of Contents Toc
Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Hadoop MapReduce v2 Cookbook - Second Edition: RAW - Second Edition

4.4 (7)
close
close
Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

4.4 (7)

Overview of this book

If you are a Big Data enthusiast and wish to use Hadoop v2 to solve your problems, then this book is for you. This book is for Java programmers with little to moderate knowledge of Hadoop MapReduce. This is also a one-stop reference for developers and system admins who want to quickly get up to speed with using Hadoop v2. It would be helpful to have a basic knowledge of software development using Java and a basic working knowledge of Linux.
Table of Contents (12 chapters)
close
close
11
Index

Setting up HDFS

HDFS is a block structured distributed filesystem that is designed to store petabytes of data reliably on top of clusters made out of commodity hardware. HDFS supports storing massive amounts of data and provides high throughput access to the data. HDFS stores file data across multiple nodes with redundancy to ensure fault-tolerance and high aggregate bandwidth.

HDFS is the default distributed filesystem used by the Hadoop MapReduce computations. Hadoop supports data locality aware processing of data stored in HDFS. HDFS architecture consists mainly of a centralized NameNode that handles the filesystem metadata and DataNodes that store the real data blocks. HDFS data blocks are typically coarser grained and perform better with large streaming reads.

To set up HDFS, we first need to configure a NameNode and DataNodes, and then specify the DataNodes in the slaves file. When we start the NameNode, the startup script will start the DataNodes.

Tip

Installing HDFS directly using Hadoop release artifacts as mentioned in this recipe is recommended for development testing and for advanced use cases only. For regular production clusters, we recommend using a packaged Hadoop distribution as mentioned in the Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution recipe. Packaged Hadoop distributions make it much easier to install, configure, maintain, and update the components of the Hadoop ecosystem.

Getting ready

You can follow this recipe either using a single machine or multiple machines. If you are using multiple machines, you should choose one machine as the master node where you will run the HDFS NameNode. If you are using a single machine, use it as both the name node as well as the DataNode.

  1. Install JDK 1.6 or above (Oracle JDK 1.7 is preferred) in all machines that will be used to set up the HDFS cluster. Set the JAVA_HOME environment variable to point to the Java installation.
  2. Download Hadoop by following the Setting up Hadoop v2 on your local machine recipe.

How to do it...

Now let's set up HDFS in the distributed mode:

  1. Set up password-less SSH from the master node, which will be running the NameNode, to the DataNodes. Check that you can log in to localhost and to all other nodes using SSH without a passphrase by running one of the following commands:
    $ ssh localhost
    $ ssh <IPaddress>
    

    Tip

    Configuring password-less SSH

    If the command in step 1 returns an error or asks for a password, create SSH keys by executing the following command (you may have to manually enable SSH beforehand depending on your OS):

    $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
    

    Move the ~/.ssh/id_dsa.pub file to all the nodes in the cluster. Then add the SSH keys to the ~/.ssh/authorized_keys file in each node by running the following command (if the authorized_keys file does not exist, run the following command. Otherwise, skip to the cat command):

    $ touch ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys
    

    Now with permissions set, add your key to the ~/.ssh/authorized_keys file:

    $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
    

    Then you should be able to execute the following command successfully, without providing a password:

    $ ssh localhost
    
  2. In each server, create a directory for storing HDFS data. Let's call that directory {HADOOP_DATA_DIR}. Create two subdirectories inside the data directory as {HADOOP_DATA_DIR}/data and {HADOOP_DATA_DIR}/name. Change the directory permissions to 755 by running the following command for each directory:
    $ chmod –R 755 <HADOOP_DATA_DIR>
    
  3. In the NameNode, add the IP addresses of all the slave nodes, each on a separate line, to the{HADOOP_HOME}/etc/hadoop/slaves file. When we start the NameNode, it will use this slaves file to start the DataNodes.
  4. Add the following configurations to {HADOOP_HOME}/etc/hadoop/core-site.xml. Before adding the configurations, replace the {NAMENODE} strings with the IP of the master node:
    <configuration>
      <property>
        <name>fs.defaultFS</name>
        <value>hdfs://{NAMENODE}:9000/</value>
      </property>
    </configuration>
  5. Add the following configurations to the {HADOOP_HOME}/etc/hadoop/hdfs-site.xml files in the {HADOOP_HOME}/etc/hadoop directory. Before adding the configurations, replace the {HADOOP_DATA_DIR} with the directory you created in the first step. Replicate the core-site.xml and hdfs-site.xml files we modified in steps 4 and 5 to all the nodes.
    <configuration>
      <property>
        <name>dfs.namenode.name.dir</name>
        <!-- Path to store namespace and transaction logs -->
        <value>{HADOOP_DATA_DIR}/name</value>
      </property>
      <property>
        <name>dfs.datanode.data.dir</name>
        <!-- Path to store data blocks in datanode -->
        <value>{HADOOP_DATA_DIR}/data</value>
      </property>
    </configuration>
  6. From the NameNode, run the following command to format a new filesystem:
    $ $HADOOP_HOME/bin/hdfs namenode –format
    

    You will see the following line in the output after the successful completion of the previous command:

    …
    13/04/09 08:44:51 INFO common.Storage: Storage directory /…/dfs/name has been successfully formatted.
    ….
  7. Start the HDFS using the following command:
    $ $HADOOP_HOME/sbin/start-dfs.sh
    

    This command will first start a NameNode in the master node. Then it will start the DataNode services in the machines mentioned in the slaves file. Finally, it'll start the secondary NameNode.

  8. HDFS comes with a monitoring web console to verify the installation and to monitor the HDFS cluster. It also lets users explore the contents of the HDFS filesystem. The HDFS monitoring console can be accessed from http://{NAMENODE}:50070/. Visit the monitoring console and verify whether you can see the HDFS startup page. Here, replace {NAMENODE} with the IP address of the node running the HDFS NameNode.
  9. Alternatively, you can use the following command to get a report about the HDFS status:
    $ $HADOOP_HOME/bin/hadoop dfsadmin -report
    
  10. Finally, shut down the HDFS cluster using the following command:
    $ $HADOOP_HOME/sbin/stop-dfs.sh
    

See also

  • In the HDFS command-line file operations recipe, we will explore how to use HDFS to store and manage files.
  • The HDFS setup is only a part of the Hadoop installation. The Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2 recipe describes how to set up the rest of Hadoop.
  • The Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution recipe explores how to use a packaged Hadoop distribution to install the Hadoop ecosystem in your cluster.
CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Hadoop MapReduce v2 Cookbook - Second Edition: RAW
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon