Book Image

Hadoop Real-World Solutions Cookbook - Second Edition

By : Tanmay Deshpande
Book Image

Hadoop Real-World Solutions Cookbook - Second Edition

By: Tanmay Deshpande

Overview of this book

Big data is the current requirement. Most organizations produce huge amount of data every day. With the arrival of Hadoop-like tools, it has become easier for everyone to solve big data problems with great efficiency and at minimal cost. Grasping Machine Learning techniques will help you greatly in building predictive models and using this data to make the right decisions for your organization. Hadoop Real World Solutions Cookbook gives readers insights into learning and mastering big data via recipes. The book not only clarifies most big data tools in the market but also provides best practices for using them. The book provides recipes that are based on the latest versions of Apache Hadoop 2.X, YARN, Hive, Pig, Sqoop, Flume, Apache Spark, Mahout and many more such ecosystem tools. This real-world-solution cookbook is packed with handy recipes you can apply to your own everyday issues. Each chapter provides in-depth recipes that can be referenced easily. This book provides detailed practices on the latest technologies such as YARN and Apache Spark. Readers will be able to consider themselves as big data experts on completion of this book. This guide is an invaluable tutorial if you are planning to implement a big data warehouse for your business.
Table of Contents (18 chapters)
Hadoop Real-World Solutions Cookbook Second Edition
Credits
About the Author
Acknowledgements
About the Reviewer
www.PacktPub.com
Preface
Index

Installing a multi-node Hadoop cluster


Now that we are comfortable with a single-node Hadoop installation, it's time to learn about a multi-node Hadoop installation.

Getting ready

In the previous recipe, we used a single Ubuntu machine for installation; in this recipe, we will be using three Ubuntu machines. If you are an individual trying to install Hadoop for your own purposes and you don't have three machines to try this recipe, I would suggest that you get three AWS EC2 Ubuntu machines. I am using the t2.small type of EC2 instances. For more information on this, go to https://aws.amazon.com/ec2/.

Apart from this, I've also performed the following configurations on all the EC2 instances that I have been using:

  1. Create an AWS security group to allow access to traffic to EC2 instances, and add EC2 instances to this security group.

  2. Change the hostname of EC2 instances to their public hostnames like this:

    sudo hostname ec2-52-10-22-65.us-west-2.compute.amazonaws.com
    
  3. Disable firewalls for EC2 Ubuntu instances:

    sudo ufw disable
    

How to do it...

There are a lot of similarities between single-node and multi-node Hadoop installations, so instead of repeating the steps, I would suggest that you refer to earlier recipes as and when they're mentioned. So, let's start installing a multi-node Hadoop cluster:

  1. Install Java and Hadoop, as discussed in the previous recipe, on the master and slave nodes. Refer to steps 1-5 in the previous recipe.

  2. AWS EC2 has a built-in installation of ssh so there's no need to install it again. To configure it, we need to perform the following steps.

    First, copy the PEM key with which you initiated EC2 instances to the master node. Next, you need to execute the following set of commands that will add an identity into the master's ssh configurations, which can be used to perform passwordless logins to slave machines:

    eval `ssh-agent -s`
    chmod 644 $HOME/.ssh/authorized_keys
    chmod 400 <my-pem-key>.pem
    ssh-add <my-pem-key>.pem
    

    But if you are NOT using AWS EC2, then you need to generate the ssh key on the master, and this key needs to be copied to slave machines. Here is a sample command to do this:

    ssh-keygen -t rsa -P ""
    ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@slave1
    ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@slave2
    
  3. Next, we need to perform the Hadoop configurations—most of the configuration files will be same as they were in the case of the single-node installation. These configurations are the same for all the nodes in the cluster. Refer to step 8 from the previous recipe for hadoop-env.sh, mapred-site.xml, and hdfs-site.xml. For core-site.xml and yarn-site.xml, we need to add some more properties, as shown here:

    Edit core-site.xml and add the host and port on which you wish to install NameNode. As this is a multi-node Hadoop cluster installation, we will add the master's hostname instead of the localhost:

    <configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://<master's-host-name>:9000/</value>
    </property>
    </configuration>

    Edit yarn-site.xml and add the following properties. As this is a multi-node installation, we also need to provide the address of the machine where ResourceManager is running:

    <configuration>
        <property>
          <name>yarn.nodemanager.aux-services</name>
          <value>mapreduce_shuffle</value>
        </property>
        <property>
          <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
          <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
            <name>yarn.resourcemanager.hostname</name>
            <value><master's-host-name></value>
        </property>
    </configuration>

    In the case of hdfs-site.xml, in the previous recipe, we set the replication factor to 1. As this is a multi-node cluster, we set it to 3. Don't forget to create storage folders configured in hdfs-site.xml.

    These configurations need to be made on all the machines of the cluster.

  4. Now that we are done with configurations, execute the namenode format command so that it creates the required subfolder structure:

    hadoop namenode -format
    
  5. Now, we need to start specific services on specific nodes in order to start the cluster.

    On the master node, execute following:

    /usr/local/hadoop/sbin/hadoop-daemon.sh start namenode
    /usr/local/hadoop/sbin/hadoop-daemon.sh start secondarynamenode
    /usr/local/hadoop/sbin/yarn-daemon.sh start resourcemanager
    

    On all slave nodes, execute following:

    /usr/local/hadoop/sbin/hadoop-daemon.sh start datanode
    /usr/local/hadoop/sbin/yarn-daemon.sh start nodemanager
    

    If everything goes well, you should be able to see the cluster running properly. You can also check out the web interfaces for NameNode and Resource Managers, for example, by going to http://<master-ip-hostname>:50070/.

For ResourceManager, go to http://<master-ip-hostname>/8088

How it works...

Refer to the How it works section from the previous recipe.