Book Image

Hadoop 2.x Administration Cookbook

By : Aman Singh
Book Image

Hadoop 2.x Administration Cookbook

By: Aman Singh

Overview of this book

Hadoop enables the distributed storage and processing of large datasets across clusters of computers. Learning how to administer Hadoop is crucial to exploit its unique features. With this book, you will be able to overcome common problems encountered in Hadoop administration. The book begins with laying the foundation by showing you the steps needed to set up a Hadoop cluster and its various nodes. You will get a better understanding of how to maintain Hadoop cluster, especially on the HDFS layer and using YARN and MapReduce. Further on, you will explore durability and high availability of a Hadoop cluster. You’ll get a better understanding of the schedulers in Hadoop and how to configure and use them for your tasks. You will also get hands-on experience with the backup and recovery options and the performance tuning aspects of Hadoop. Finally, you will get a better understanding of troubleshooting, diagnostics, and best practices in Hadoop administration. By the end of this book, you will have a proper understanding of working with Hadoop clusters and will also be able to secure, encrypt it, and configure auditing for your Hadoop clusters.
Table of Contents (20 chapters)
Hadoop 2.x Administration Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
Index

Installing a single-node cluster - HDFS components


Usually the term cluster means a group of machines, but in this recipe, we will be installing various Hadoop daemons on a single node. The single machine will act as both the master and slave for the storage and processing layer.

Getting ready

You will need some information before stepping through this recipe.

Although Hadoop can be configured to run as root user, it is a good practice to run it as a non-privileged user. In this recipe, we are using the node name nn1.cluster1.com, preinstalled with CentOS 6.5.

Tip

Create a system user named hadoop and set a password for that user.

Install JDK, which will be used by Hadoop services. The minimum recommended version of JDK is 1.7, but Open JDK can also be used.

How to do it...

  1. Log into the machine/host as root user and install jdk:

    # yum install jdk –y
    or it can also be installed using the command as below
    # rpm –ivh jdk-1.7u45.rpm
    
  2. Once Java is installed, make sure Java is in PATH for execution. This can be done by setting JAVA_HOME and exporting it as an environment variable. The following screenshot shows the content of the directory where Java gets installed:

    # export JAVA_HOME=/usr/java/latest
    
  3. Now we need to copy the tarball hadoop-2.7.3.tar.gz--which was built in the Build Hadoop section earlier in this chapter—to the home directory of the user root. For this, the user needs to login to the node where Hadoop was built and execute the following command:

    # scp –r hadoop-2.7.3.tar.gz [email protected]:~/
    
  4. Create a directory named/opt/cluster to be used for Hadoop:

    # mkdir –p /opt/cluster
    
  5. Then untar the hadoop-2.7.3.tar.gz to the preceding created directory:

    # tar –xzvf hadoop-2.7.3.tar.gz  -C /opt/Cluster/
    
  6. Create a user named hadoop, if you haven't already, and set the password as hadoop:

    # useradd hadoop
    # echo hadoop | passwd --stdin hadoop
    
  7. As step 6 was done by the root user, the directory and file under /opt/cluster will be owned by the root user. Change the ownership to the Hadoop user:

    # chown -R hadoop:hadoop /opt/cluster/
    
  8. If the user lists the directory structure under /opt/cluster, he will see it as follows:

  9. The directory structure under /opt/cluster/hadoop-2.7.3 will look like the one shown in the following screenshot:

  10. The listing shows etc, bin, sbin, and other directories.

  11. The etc/hadoop directory is the one that contains the configuration files for configuring various Hadoop daemons. Some of the key files are core-site.xml, hdfs-site.xml, hadoop-env.xml, and mapred-site.xml among others, which will be explained in the later sections:

  12. The directories bin and sbin contain executable binaries, which are used to start and stop Hadoop daemons and perform other operations such as filesystem listing, copying, deleting, and so on:

  13. To execute a command /opt/cluster/hadoop-2.7.3/bin/hadoop, a complete path to the command needs to be specified. This could be cumbersome, and can be avoided by setting the environment variable HADOOP_HOME.

  14. Similarly, there are other variables that need to be set that point to the binaries and the configuration file locations:

  15. The environment file is set up system-wide so that any user can use the commands. Once the hadoopenv.sh file is in place, execute the command to export the variables defined in it:

  16. Change to the Hadoop user using the command su – hadoop:

  17. Change to the /opt/cluster directory and create a symlink:

  18. To verify that the preceding changes are in place, the user can execute either the which Hadoop or which java commands, or the user can execute the command hadoop directly without specifying the complete path.

  19. In addition to setting the environment as discussed, the user has to add the JAVA_HOME variable in the hadoop-env.sh file.

  20. The next thing is to set up the Namenode address, which specifies the host:port address on which it will listen. This is done using the file core-site.xml:

  21. The important thing to keep in mind is the property fs.defaultFS.

  22. The next thing that the user needs to configure is the location where Namenode will store its metadata. This can be any location, but it is recommended that you always have a dedicated disk for it. This is configured in the file hdfs-site.xml:

  23. The next step is to format the Namenode. This will create an HDFS file system:

    $ hdfs namenode -format
    
  24. Similarly, we have to add the rule for the Datanode directory under hdfs-site.xml. Nothing needs to be done to the core-site.xml file:

  25. Then the services need to be started for Namenode and Datanode:

    $ hadoop-daemon.sh start namenode
    $ hadoop-daemon.sh start datanode
    
  26. The command jps can be used to check for running daemons:

How it works...

The master Namenode stores metadata and the slave node Datanode stores the blocks. When the Namenode is formatted, it creates a data structure that contains fsimage, edits, and VERSION. These are very important for the functioning of the cluster.

The parameters dfs.data.dir and dfs.datanode.data.dir are used for the same purpose, but are used across different versions. The older parameters are deprecated in favor of the newer ones, but they will still work. The parameter dfs.name.dir has been deprecated in favor of dfs.namenode.name.dir in Hadoop 2.x. The intention of showing both versions of the parameter is to bring to the user's notice that parameters are evolving and ever changing, and care must be taken by referring to the release notes for each Hadoop version.

There's more...

Setting up ResourceManager and NodeManager

In the preceding recipe, we set up the storage layer—that is, the HDFS for storing data—but what about the processing layer?. The data on HDFS needs to be processed to make a meaningful decision using MapReduce, Tez, Spark, or any other tool. To run the MapReduce, Spark or other processing framework we need to have ResourceManager, NodeManager.