Book Image

Hadoop Real-World Solutions Cookbook - Second Edition

By : Tanmay Deshpande
Book Image

Hadoop Real-World Solutions Cookbook - Second Edition

By: Tanmay Deshpande

Overview of this book

Big data is the current requirement. Most organizations produce huge amount of data every day. With the arrival of Hadoop-like tools, it has become easier for everyone to solve big data problems with great efficiency and at minimal cost. Grasping Machine Learning techniques will help you greatly in building predictive models and using this data to make the right decisions for your organization. Hadoop Real World Solutions Cookbook gives readers insights into learning and mastering big data via recipes. The book not only clarifies most big data tools in the market but also provides best practices for using them. The book provides recipes that are based on the latest versions of Apache Hadoop 2.X, YARN, Hive, Pig, Sqoop, Flume, Apache Spark, Mahout and many more such ecosystem tools. This real-world-solution cookbook is packed with handy recipes you can apply to your own everyday issues. Each chapter provides in-depth recipes that can be referenced easily. This book provides detailed practices on the latest technologies such as YARN and Apache Spark. Readers will be able to consider themselves as big data experts on completion of this book. This guide is an invaluable tutorial if you are planning to implement a big data warehouse for your business.
Table of Contents (18 chapters)
Hadoop Real-World Solutions Cookbook Second Edition
Credits
About the Author
Acknowledgements
About the Reviewer
www.PacktPub.com
Preface
Index

Installing a single-node Hadoop Cluster


In this recipe, we are going to learn how to install a single-node Hadoop cluster, which can be used for development and testing.

Getting ready

To install Hadoop, you need to have a machine with the UNIX operating system installed on it. You can choose from any well known UNIX OS such as Red Hat, CentOS, Ubuntu, Fedora, and Amazon Linux (this is in case you are using Amazon Web Service instances).

Here, we will be using the Ubuntu distribution for demonstration purposes.

How to do it...

Let's start installing Hadoop:

  1. First of all, you need to download the required installers from the Internet. Here, we need to download Java and Hadoop installers. The following are the links to do this:

    For the Java download, choose the latest version of the available JDK from http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html.

    You can also use Open JDK instead of Oracle.

    For the Hadoop 2.7 Download, go to

    http://www.eu.apache.org/dist/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz.

  2. We will first install Java. Here, I am using /usr/local as the installation directory and the root user for all installations. You can choose a directory of your choice.

    Extract tar.gz like this:

    tar -xzf java-7-oracle.tar.gz
    

    Rename the extracted folder to give the shorter name Java instead of java-7-oracle. Doing this will help you remember the folder name easily.

    Alternately, you can install Java using the apt-get package manager if your machine is connected to the Internet:

    sudo apt-get update
    sudo apt-get install openjdk-7-jdk
    
  3. Similarly, we will extract and configure Hadoop. We will also rename the extracted folder for easier accessibility. Here, we will extract Hadoop to path /usr/local:

    tar –xzf hadoop-2.7.0.tar.gz
    mv hadoop-2.7.0 hadoop
    
  4. Next, in order to use Java and Hadoop from any folder, we would need to add these paths to the ~/.bashrc file. The contents of the file get executed every time a user logs in:

    cd ~
    vi .bashrc
    

    Once the file is open, append the following environment variable settings to it. These variables are used by Java and Hadoop at runtime:

    export JAVA_HOME=/usr/local/java
    export PATH=$PATH:$JAVA_HOME/bin
    export HADOOP_INSTALL=/usr/local/hadoop
    export PATH=$PATH:$HADOOP_INSTALL/bin
    export PATH=$PATH:$HADOOP_INSTALL/sbin
    export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
    export HADOOP_COMMON_HOME=$HADOOP_INSTALL
    export HADOOP_HDFS_HOME=$HADOOP_INSTALL
    export YARN_HOME=$HADOOP_INSTALL
    export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
    export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
  5. In order to verify whether our installation is perfect, close the terminal and restart it again. Also, check whether the Java and Hadoop versions can be seen:

    $java -version
    java version "1.7.0_45"
    Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
    Java HotSpot(TM) Server VM (build 24.45-b08, mixed mode)
    
    $ hadoop version
    Hadoop 2.7.0
    Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r d4c8d4d4d203c934e8074b31289a28724c0842cf
    Compiled by jenkins on 2015-04-10T18:40Z
    Compiled with protoc 2.5.0
    From source with checksum a9e90912c37a35c3195d23951fd18f

    This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-2.7.0.jar.

  6. Now that Hadoop and Java are installed and verified, we need to install ssh (Secure Shell) if it's not already available by default. If you are connected to the Internet, execute the following commands. SSH is used to secure data transfers between nodes:

    sudo apt-get install openssh-client
    sudo apt-get install openssh-server
    
  7. Once the ssh installation is done, we need to execute the ssh configuration in order to avail a passwordless access to remote hosts. Note that even though we are installing Hadoop on a single node, we need to perform an ssh configuration in order to securely access the localhost.

    First of all, we need to generate public and private keys by executing the following command:

    ssh-keygen -t rsa -P ""
    

    This will generate the private and public keys by default in the $HOME/.ssh folder. In order to provide passwordless access, we need to append the public key to authorized_keys file:

    cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
    

    Let's check whether the ssh configuration is okay or not. To test it, execute and connect to the localhost like this:

    ssh localhost
    

    This will prompt you to confirm whether to add this connection to the known_hosts file. Type yes, and you should be connected to ssh without prompting for the password.

  8. Once the ssh configuration is done and verified, we need to configure Hadoop. The Hadoop configuration begins with adding various configuration parameters to the following default files:

    • hadoop-env.sh: This is where we need to perform the Java environment variable configuration.

    • core-site.xml: This is where we need to perform NameNode-related configurations.

    • yarn-site.xml: This is where we need to perform configurations related to Yet Another Resource Negotiator (YARN).

    • mapred-site.xml: This is where we need to the map reduce engine as YARN.

    • hdfs-site.xml: This is where we need to perform configurations related to Hadoop Distributed File System (HDFS).

    These configuration files can be found in the /usr/local/hadoop/etc/hadoop folder. If you install Hadoop as the root user, you will have access to edit these files, but if not, you will first need to get access to this folder before editing.

So, let's take a look at the configurations one by one.

Configure hadoop-env.sh and update the Java path like this:

  1. Export JAVA_HOME=/usr/local/java.

  2. Edit core-site.xml, and add the host and port on which you wish to install NameNode. Here is the single node installation that we would need in order to add the localhost:

    <configuration>
      <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000/</value>
      </property>
    </configuration>
  3. Edit yarn-site.xml, add the following properties to it:

    <configuration>
       <property>
          <name>yarn.nodemanager.aux-services</name>
          <value>mapreduce_shuffle</value>
       </property>
       <property>
          <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
          <value>org.apache.hadoop.mapred.ShuffleHandler</value>
       </property>
    </configuration>

    The yarn.nodemanager.aux-services property tells NodeManager that an auxiliary service named mapreduce.shuffle is present and needs to be implemented. The second property tells NodeManager about the class by which means it needs to implement the shuffle auxiliary service. This specific configuration is needed as the MapReduce job involves shuffling of key value pairs.

  4. Next, edit mapred-site.xml to set the map reduce processing engine as YARN:

    <configuration>
       <property>
          <name>mapreduce.framework.name</name>
          <value>yarn</value>
       </property>
    </configuration>
  5. Edit hdfs-site.xml to set the folder paths that can be used by NameNode and datanode:

    <configuration>
       <property>
          <name>yarn.nodemanager.aux-services</name>
          <value>mapreduce_shuffle</value>
       </property>
       <property>
          <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
          <value>org.apache.hadoop.mapred.ShuffleHandler</value>
       </property>
    </configuration>
  6. I am also setting the HDFS block replication factor to 1 as this is a single node cluster installation.

    We also need to make sure that we create the previously mentioned folders and change their ownership to suit the current user. To do this, you can choose a folder path of your own choice:

    sudo mkdir –p /usr/local/store/hdfs/namenode
    sudo mkdir –p /usr/local/store/hdfs/datanode
    sudo chown root:root –R /usr/local/store 
    
  7. Now, it's time to format namenode so that it creates the required folder structure by default:

    hadoop namenode -format
    
  8. The final step involves starting Hadoop daemons; here, we will first execute two scripts to start HDFS daemons and then start YARN daemons:

    /usr/local/hadoop/sbin/start-dfs.sh
    

This will start NameNode, the secondary NameNode, and then DataNode daemons:

/usr/local/hadoop/sbin/start-yarn.sh

This will start NodeManager and ResourceManager. You can execute the jps command to take a look at the running daemons:

$jps
2184 DataNode
2765 NodeManager
2835 Jps
2403 SecondaryNameNode
2025 NameNode
2606 ResourceManager

We can also access the web portals for HDFS and YARN by accessing the following URLs:

  • For HDFS: http://<hostname>:50070/

  • For YARN: http://<hostname>:8088/

How it works...

Hadoop 2.0 has been majorly reformed in order to solve issues of scalability and high-availability. Earlier in Hadoop 1.0, Map Reduce was the only means of processing data stored in HDFS. With advancement of YARN, Map Reduce is one of the ways of processing data on Hadoop. Here is a pictorial difference between Hadoop 1.x and Hadoop 2.x:

Now, let's try to understand how HDFS and YARN works.

Hadoop Distributed File System (HDFS)

HDFS is a redundant, reliable storage for Hadoop. It consists of three important parts: NameNode, the secondary NameNode, and DataNodes. When a file needs to be processed on Hadoop, it first needs to be saved on HDFS. HDFS distributes the file in chunks of 64/128 MB data blocks across the data nodes. The blocks are replicated across data nodes for reliability. NameNode stores the metadata in the blocks and replicas. After a certain period of time, the metadata is backed up on the secondary NameNode. The default time is 60 seconds. We can modify this by setting a property called dfs.namenode.checkpoint.check.period in hdfs-site.xml.

Yet Another Resource Negotiator (YARN)

YARN has been developed to address scalability issues and for the better management of jobs in Hadoop; till date, it has proved itself to be the perfect solution. It is responsible for the management of resources available in clusters. It consists of two important components: ResouceManager(Master) and NodeManager(Worker). NodeManager provides a node-level view of the cluster, while ResourceManager takes a view of a cluster. When an application is submitted by an application client, the following things happen:

  • The application talks to ResourceManager and provides details about it.

  • ResourceManager makes a container request on behalf of an application to any of the worker nodes and ApplicationMaster starts running within that container.

  • ApplicationMaster then makes subsequent requests for the containers to execute tasks on other nodes.

  • These tasks then take care of all the communication. Once all the tasks are complete, containers are deallocated and ApplicationMaster exits.

  • After this, the application client also exits.

There's more

Now that your single node Hadoop cluster is up and running, you can try some HDFS file operations on it, such as creating a directory, copying a file from a local machine to HDFS, and so on. Here some sample commands.

To list all the files in the HDFS root directory, take a look at this:

hadoop fs –ls /

To create a new directory, take a look at this:

hadoop fs –mkdir /input

To copy a file from the local machine to HDFS, take a look at this:

hadoop fs –copyFromLocal /usr/local/hadoop/LICENSE.txt /input

In order to access all the command options that are available, go to https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html.