Book Image

Hadoop 2.x Administration Cookbook

By : Aman Singh
Book Image

Hadoop 2.x Administration Cookbook

By: Aman Singh

Overview of this book

Hadoop enables the distributed storage and processing of large datasets across clusters of computers. Learning how to administer Hadoop is crucial to exploit its unique features. With this book, you will be able to overcome common problems encountered in Hadoop administration. The book begins with laying the foundation by showing you the steps needed to set up a Hadoop cluster and its various nodes. You will get a better understanding of how to maintain Hadoop cluster, especially on the HDFS layer and using YARN and MapReduce. Further on, you will explore durability and high availability of a Hadoop cluster. You’ll get a better understanding of the schedulers in Hadoop and how to configure and use them for your tasks. You will also get hands-on experience with the backup and recovery options and the performance tuning aspects of Hadoop. Finally, you will get a better understanding of troubleshooting, diagnostics, and best practices in Hadoop administration. By the end of this book, you will have a proper understanding of working with Hadoop clusters and will also be able to secure, encrypt it, and configure auditing for your Hadoop clusters.
Table of Contents (20 chapters)
Hadoop 2.x Administration Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
Index

Adding nodes to the cluster


Over a period of time, our cluster will grow in data and there will be a need to increase the capacity of the cluster by adding more nodes.

We can add Datanodes to the cluster in the same way that we first configured the Datanode started the Datanode daemon on it. But the important thing to keep in mind is that all nodes can be part of the cluster. It should not be that anyone can just start a Datanode daemon on his laptop and join the cluster, as it will be disastrous. By default, there is nothing preventing any node being a Datanode, as the user has just to untar the Hadoop package and point the file "core-site.xml" to the Namenode and start the Datanode daemon.

Getting ready

For the following steps, we assume that the cluster that is up and running with Datanodes is in a healthy state and we need to add a new Datanode in the cluster. We will login to the Namenode and make changes there.

How to do it...

  1. ssh to Namenode and edit the file hdfs-site.xml to add the following property to it:

    <property>
    <name>dfs.hosts</name>
    <value>/home/hadoop/includes</value>
    <final>true</final>
    </property>
  2. Make sure the file includes is readable by the user hadoop.

  3. Restart the Namenode daemon for the property to take effect:

    $ hadoop-daemons.sh stop namenode
    $ hadoop-daemons.sh start namenode
    
  4. A restart of Namenode is required only when any property is changed in the file. Once the property is in place, Namenode can read the changes to the contents of the includes file by simply refreshing the nodes.

  5. Add the dn1.cluster1.com node to the file excludes:

    $ cat includes
    dn1.cluster1.com
    
  6. The file includes or excludes can contain a list of multiple nodes, one node per line.

  7. After adding the node to the file, we just need to reload the file by entering the following:

    $ hadoop dfsadmin -refreshNodes
    
  8. After some time, the node will be available in the cluster and can be seen:

    $ hdfs dfsadmin -report
    

How it works...

The file /home/hadoop/includes will contain a list of all the Datanodes that are allowed to join a cluster. If the file includes is blank, then all Datanodes are allowed to join the cluster. If there is both an include and exclude file, the list of nodes must be mutually exclusive in both the files. So, to decommission the node dn1.cluster.com from the cluster, it must be removed from the includes file and added to the excludes file.

There's more...

In addition to controlling the nodes as we described, there will be firewall rules in place and separate VLANs for Hadoop clusters to keep the traffic and data isolated.