Hadoop 2.x Administration Cookbook

Hadoop 2.x Administration Cookbook

By : Aman Singh

Buy this Book

Hadoop 2.x Administration Cookbook

By: Aman Singh

Buy this Book

Overview of this book

Hadoop enables the distributed storage and processing of large datasets across clusters of computers. Learning how to administer Hadoop is crucial to exploit its unique features. With this book, you will be able to overcome common problems encountered in Hadoop administration. The book begins with laying the foundation by showing you the steps needed to set up a Hadoop cluster and its various nodes. You will get a better understanding of how to maintain Hadoop cluster, especially on the HDFS layer and using YARN and MapReduce. Further on, you will explore durability and high availability of a Hadoop cluster. You’ll get a better understanding of the schedulers in Hadoop and how to configure and use them for your tasks. You will also get hands-on experience with the backup and recovery options and the performance tuning aspects of Hadoop. Finally, you will get a better understanding of troubleshooting, diagnostics, and best practices in Hadoop administration. By the end of this book, you will have a proper understanding of working with Hadoop clusters and will also be able to secure, encrypt it, and configure auditing for your Hadoop clusters.

Hadoop 2.x Administration Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Hadoop Architecture and Deployment

Introduction

Building and compiling Hadoop

Installation methods

Setting up host resolution

Installing a single-node cluster - HDFS components

Installing a single-node cluster - YARN components

Installing a multi-node cluster

Configuring the Hadoop Gateway node

Decommissioning nodes

Adding nodes to the cluster

Maintaining Hadoop Cluster HDFS

Introduction

Configuring HDFS block size

Setting up Namenode metadata location

Loading data in HDFS

Configuring HDFS replication

HDFS balancer

Quota configuration

HDFS health and FSCK

Configuring rack awareness

Recycle or trash bin configuration

Distcp usage

Control block report storm

Configuring Datanode heartbeat

Maintaining Hadoop Cluster – YARN and MapReduce

Introduction

Running a simple MapReduce program

Hadoop streaming

Configuring YARN history server

Job history web interface and metrics

Configuring ResourceManager components

YARN containers and resource allocations

ResourceManager Web UI and JMX metrics

Preserving ResourceManager states

High Availability

Introduction

Namenode HA using shared storage

ZooKeeper configuration

Namenode HA using Journal node

Resourcemanager HA using ZooKeeper

Rolling upgrade with HA

Configure shared cache manager

Configure HDFS cache

HDFS snapshots

Configuring storage based policies

Configuring HA for Edge nodes

Schedulers

Introduction

Configuring users and groups

Fair Scheduler configuration

Fair Scheduler pools

Configuring job queues

Job queue ACLs

Configuring Capacity Scheduler

Queuing mappings in Capacity Scheduler

YARN and Mapred commands

YARN label-based scheduling

YARN SLS

Backup and Recovery

Introduction

Initiating Namenode saveNamespace

Using HDFS Image Viewer

Fetching parameters which are in-effect

Configuring HDFS and YARN logs

Backing up and recovering Namenode

Configuring Secondary Namenode

Promoting Secondary Namenode to Primary

Namenode recovery

Namenode roll edits – online mode

Namenode roll edits – offline mode

Datanode recovery – disk full

Configuring NFS gateway to serve HDFS

Recovering deleted files

Data Ingestion and Workflow

Introduction

Hive server modes and setup

Using MySQL for Hive metastore

Operating Hive with ZooKeeper

Loading data into Hive

Partitioning and Bucketing in Hive

Hive metastore database

Designing Hive with credential store

Configuring Flume

Configure Oozie and workflows

Performance Tuning

Tuning the operating system

Configuring YARN for performance

Configuring MapReduce for performance

Hive performance tuning

Benchmarking Hadoop cluster

HBase Administration

Introduction

Setting up single node HBase cluster

Setting up multi-node HBase cluster

Inserting data into HBase

Integration with Hive

HBase administration commands

HBase backup and restore

Tuning HBase

HBase upgrade

Migrating data from MySQL to HBase using Sqoop

Cluster Planning

Introduction

Disk space calculations

Nodes needed in the cluster

Memory requirements

Sizing the cluster as per SLA

Network design

Estimating the cost of the Hadoop cluster

Hardware and software options

Troubleshooting, Diagnostics, and Best Practices

Introduction

Namenode troubleshooting

Datanode troubleshooting

Resourcemanager troubleshooting

Diagnose communication issues

Parse logs for errors

Hive troubleshooting

HBase troubleshooting

Hadoop best practices

Security

Introduction

Encrypting disk using LUKS

Configuring Hadoop users

HDFS encryption at Rest

Configuring SSL in Hadoop

In-transit encryption

Enabling service level authorization

Securing ZooKeeper

Configuring auditing

Configuring Kerberos server

Configuring and enabling Kerberos for Hadoop

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Installing a multi-node cluster

In the previous recipes, we looked at how to configure a single-node Hadoop cluster, also referred to as pseudo-distributed cluster. In this recipe, we will set up a fully distributed cluster with each daemon running on separate nodes.

There will be one node for Namenode, one for ResourceManager, and four nodes will be used for Datanode and NodeManager. In production, the number of Datanodes could be in the thousands, but here we are just restricted to four nodes. The Datanode and NodeManager coexist on the same nodes for the purposes of data locality and locality of reference.

Getting ready

Make sure that the six nodes the user chooses have JDK installed, with name resolution working. This could be done by making entries in the /etc/hosts file or using DNS.

In this recipe, we are using the following nodes:

Namenode: nn1.cluster1.com
ResourceManager: jt1.cluster1.com
Datanodes and NodeManager: dn[1-4].cluster1.com

How to do it...

Make sure all the nodes have the hadoop user.
Create the directory structure /opt/cluster on all the nodes.
Make sure the ownership is correct for /opt/cluster.
Copy the /opt/cluster/hadoop-2.7.3 directory from the nn1.cluster.com to all the nodes in the cluster:
```
$ for i in 192.168.1.{72..75};do scp -r hadoop-2.7.3 $i:/opt/cluster/ $i;done
```
The preceding IPs belong to the nodes in the cluster. The user needs to modify them accordingly. Also, to prevent it from prompting for password for each node, it is good to set up pass phraseless access between each node.
Change to the directory /opt/cluster and create a symbolic link on each node:
```
$ ln –s hadoop-2.7.3 hadoop
```
Make sure that the environment variables have been set up on all nodes:
```
$ . /etc/profile.d/hadoopenv.sh
```
On Namenode, only the parameters specific to it are needed.
The file core-site.xml remains the same across all nodes in the cluster.
On Namenode, the file hdfs-site.xml changes as follows:
On Datanode, the file hdfs-site.xml changes as follows:
On Datanodes, the file yarn-site.xml changes as follows:
On the node jt1, which is ResourceManager, the file yarn-site.xml is as follows:
To start Namenode on nn1.cluster1.com, enter the following:
```
$ hadoop-daemon.sh start namenode
```

To start Datanode and NodeManager on dn[1-4], enter the following:

$ hadoop-daemon.sh start datanode
$ yarn-daemon.sh start nodemanager

To start ResourceManager on jt1.cluster.com, enter the following:
```
$ yarn-daemon.sh start resourcemanager
```
On each node, execute the command jps to see the daemons running on them. Make sure you have the correct services running on each node.
Create a text file test.txt and copy it to HDFS using hadoop fs –put test.txt /. This confirms that HDFS is working fine.
To verify that YARN has been set up correctly, run the simple "Pi" estimation program:
```
$ yarn jar /opt/cluster/hadoop/share/hadoop/mapreduce/hadoop-example.jar Pi 3 3
```

How it works...

Steps 1 through 7 copy the already extracted and configured Hadoop files to other nodes in the cluster. From step 8 onwards, each node is configured according to the role it plays in the cluster.

The user should see four Datanodes reporting to the cluster, and should also be able to access the UI of the Namenode on port 50070 and on port 8088 for ResourceManager.

To see the number of nodes talking to Namenode, enter the following:

$ hdfs dfsadmin -report
  Configured Capacity: 9124708352 (21.50 GB)
  Present Capacity: 5923942400 (20.52 GB)
  DFS Remaining: 5923938304 (20.52 GB)
  DFS Used: 4096 (4 KB)
  DFS Used%: 0.00%
Live datanodes (4):

The same information can also be retrieved using the Namenode Web UI as shown in the following screenshot:

Note

The user can configure any customer port for any service, but there should be a good reason to change the defaults.

Hadoop 2.x Administration Cookbook

By : Aman Singh

Hadoop 2.x Administration Cookbook

By: Aman Singh

Overview of this book

Related Content you might be interested in

Current Title:

Hadoop 2.x Administration Cookbook

Mastering Hadoop 3

Apache Hadoop 3 Quick Start Guide

HBase High Performance Cookbook

Installing a multi-node cluster

Getting ready

How to do it...

How it works...

Note