Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Overview of this book

Table of Contents (19 chapters)
Hadoop MapReduce v2 Cookbook Second Edition
Credits
About the Author
Acknowledgments
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution


The Hadoop YARN ecosystem now contains many useful components providing a wide range of data processing, storing, and querying functionalities for the data stored in HDFS. However, manually installing and configuring all of these components to work together correctly using individual release artifacts is quite a challenging task. Other challenges of such an approach include the monitoring and maintenance of the cluster and the multiple Hadoop components.

Luckily, there exist several commercial software vendors that provide well integrated packaged Hadoop distributions to make it much easier to provision and maintain a Hadoop YARN ecosystem in our clusters. These distributions often come with easy GUI-based installers that guide you through the whole installation process and allow you to select and install the components that you require in your Hadoop cluster. They also provide tools to easily monitor the cluster and to perform maintenance operations. For regular production clusters, we recommend using a packaged Hadoop distribution from one of the well-known vendors to make your Hadoop journey much easier. Some of these commercial Hadoop distributions (or editions of the distribution) have licenses that allow us to use them free of charge with optional paid support agreements.

Hortonworks Data Platform (HDP) is one such well-known Hadoop YARN distribution that is available free of charge. All the components of HDP are available as free and open source software. You can download HDP from http://hortonworks.com/hdp/downloads/. Refer to the installation guides available in the download page for instructions on the installation.

Cloudera CDH is another well-known Hadoop YARN distribution. The Express edition of CDH is available free of charge. Some components of the Cloudera distribution are proprietary and available only for paying clients. You can download Cloudera Express from http://www.cloudera.com/content/cloudera/en/products-and-services/cloudera-express.html. Refer to the installation guides available on the download page for instructions on the installation.

Hortonworks HDP, Cloudera CDH, and some of the other vendors provide fully configured quick start virtual machine images that you can download and run on your local machine using a virtualization software product. These virtual machines are an excellent resource to learn and try the different Hadoop components as well as for evaluation purposes before deciding on a Hadoop distribution for your cluster.

Apache Bigtop is an open source project that aims to provide packaging and integration/interoperability testing for the various Hadoop ecosystem components. Bigtop also provides a vendor neutral packaged Hadoop distribution. While it is not as sophisticated as the commercial distributions, Bigtop is easier to install and maintain than using release binaries of each of the Hadoop components. In this recipe, we provide steps to use Apache Bigtop to install Hadoop ecosystem in your local machine.

Any of the earlier mentioned distributions, including Bigtop, is suitable for the purposes of following the recipes and executing the samples provided in this book. However, when possible, we recommend using Hortonworks HDP, Cloudera CDH, or other commercial Hadoop distributions.

Getting ready

This recipe provides instructions for the Cent OS and Red Hat operating systems. Stop any Hadoop service that you started in the previous recipes.

How to do it...

The following steps will guide you through the installation process of a Hadoop cluster using Apache Bigtop for Cent OS and Red Hat operating systems. Please adapt the commands accordingly for other Linux-based operating systems.

  1. Install the Bigtop repository:

    $ sudo wget -O \
    /etc/yum.repos.d/bigtop.repo \ 
    http://www.apache.org/dist/bigtop/stable/repos/centos6/bigtop.repo
    
  2. Search for Hadoop:

    $ yum search hadoop
    
  3. Install Hadoop v2 using Yum. This will install Hadoop v2 components (MapReduce, HDFS, and YARN) together with the ZooKeeper dependency.

    $ sudo yum install hadoop\*
    
  4. Use your favorite editor to add the following line to the /etc/default/bigtop-utils file. It is recommended to point JAVA_HOME to a JDK 1.6 or later installation (Oracle JDK 1.7 or higher is preferred).

    export JAVA_HOME=/usr/java/default/
  5. Initialize and format the NameNode:

    $ sudo  /etc/init.d/hadoop-hdfs-namenode init
    
  6. Start the Hadoop NameNode service:

    $ sudo service hadoop-hdfs-namenode start
    
  7. Start the Hadoop DataNode service:

    $ sudo service hadoop-hdfs-datanode start
    
  8. Run the following script to create the necessary directories in HDFS:

    $ sudo  /usr/lib/hadoop/libexec/init-hdfs.sh
    
  9. Create your home directory in HDFS and apply the necessary permisions:

    $ sudo su -s /bin/bash hdfs \
    -c "/usr/bin/hdfs dfs -mkdir /user/${USER}"
    $ sudo su -s /bin/bash hdfs \
    -c "/usr/bin/hdfs dfs -chmod -R 755 /user/${USER}"
    $ sudo su -s /bin/bash hdfs \
    -c "/usr/bin/hdfs dfs -chown ${USER} /user/${USER}"
    
  10. Start the YARN ResourceManager and the NodeManager:

    $ sudo service hadoop-yarn-resourcemanager start
    $ sudo service hadoop-yarn-nodemanager start
    $ sudo service hadoop-mapreduce-historyserver start
    
  11. Try the following commands to verify the installation:

    $ hadoop fs -ls  /
    $ hadoop jar \
    /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
    pi 10 1000
    
  12. You can also monitor the status of the HDFS using the monitoring console available at http://<namenode_ip>:50070.

  13. Install Hive, HBase, Mahout, and Pig using Bigtop as follows:

    $ sudo yum install hive\*, hbase\*, mahout\*, pig\*
    

There's more...