Hadoop Operations and Cluster Management Cookbook

Hadoop Operations and Cluster Management Cookbook

By : Shumin Guo

Buy this Book

Hadoop Operations and Cluster Management Cookbook

By: Shumin Guo

Buy this Book

Overview of this book

We are facing an avalanche of data. The unstructured data we gather can contain many insights that could hold the key to business success or failure. Harnessing the ability to analyze and process this data with Hadoop is one of the most highly sought after skills in today's job market. Hadoop, by combining the computing and storage powers of a large number of commodity machines, solves this problem in an elegant way! Hadoop Operations and Cluster Management Cookbook is a practical and hands-on guide for designing and managing a Hadoop cluster. It will help you understand how Hadoop works and guide you through cluster management tasks. This book explains real-world, big data problems and the features of Hadoop that enables it to handle such problems. It breaks down the mystery of a Hadoop cluster and will guide you through a number of clear, practical recipes that will help you to manage a Hadoop cluster. We will start by installing and configuring a Hadoop cluster, while explaining hardware selection and networking considerations. We will also cover the topic of securing a Hadoop cluster with Kerberos, configuring cluster high availability and monitoring a cluster. And if you want to know how to build a Hadoop cluster on the Amazon EC2 cloud, then this is a book for you.

Hadoop Operations and Cluster Management Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Big Data and Hadoop

Introduction

Defining a Big Data problem

Building a Hadoop-based Big Data platform

Choosing from Hadoop alternatives

Preparing for Hadoop Installation

Introduction

Choosing hardware for cluster nodes

Designing the cluster network

Configuring the cluster administrator machine

Creating the kickstart file and boot media

Installing the Linux operating system

Installing Java and other tools

Configuring SSH

Configuring a Hadoop Cluster

Introduction

Choosing a Hadoop version

Configuring Hadoop in pseudo-distributed mode

Configuring Hadoop in fully-distributed mode

Validating Hadoop installation

Configuring ZooKeeper

Managing a Hadoop Cluster

Introduction

Managing the HDFS cluster

Configuring SecondaryNameNode

Managing the MapReduce cluster

Managing TaskTracker

Decommissioning DataNode

Replacing a slave node

Managing MapReduce jobs

Checking job history from the web UI

Importing data to HDFS

Manipulating files on HDFS

Configuring the HDFS quota

Configuring CapacityScheduler

Configuring Fair Scheduler

Configuring Hadoop daemon logging

Configuring Hadoop audit logging

Upgrading Hadoop

Hardening a Hadoop Cluster

Introduction

Configuring service-level authentication

Configuring job authorization with ACL

Securing a Hadoop cluster with Kerberos

Configuring web UI authentication

Recovering from NameNode failure

Configuring NameNode high availability

Configuring HDFS federation

Monitoring a Hadoop Cluster

Introduction

Monitoring a Hadoop cluster with JMX

Monitoring a Hadoop cluster with Ganglia

Monitoring a Hadoop cluster with Nagios

Monitoring a Hadoop cluster with Ambari

Monitoring a Hadoop cluster with Chukwa

Tuning a Hadoop Cluster for Best Performance

Introduction

Benchmarking and profiling a Hadoop cluster

Analyzing job history with Rumen

Benchmarking a Hadoop cluster with GridMix

Using Hadoop Vaidya to identify performance problems

Balancing data blocks for a Hadoop cluster

Choosing a proper block size

Using compression for input and output

Configuring speculative execution

Setting proper number of map and reduce slots for the TaskTracker

Tuning the JobTracker configuration

Tuning the TaskTracker configuration

Tuning shuffle, merge, and sort parameters

Configuring memory for a Hadoop cluster

Setting proper number of parallel copies

Tuning JVM parameters

Configuring JVM Reuse

Configuring the reducer initialization time

Building a Hadoop Cluster with Amazon EC2 and S3

Introduction

Registering with Amazon Web Services (AWS)

Managing AWS security credentials

Preparing a local machine for EC2 connection

Creating an Amazon Machine Image (AMI)

Using S3 to host data

Configuring a Hadoop cluster with the new AMI

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Building a Hadoop-based Big Data platform

Hadoop was first developed as a Big Data processing system in 2006 at Yahoo! The idea is based on Google's MapReduce, which was first published by Google based on their proprietary MapReduce implementation. In the past few years, Hadoop has become a widely used platform and runtime environment for the deployment of Big Data applications. In this recipe, we will outline steps to build a Hadoop-based Big Data platform.

Getting ready

Hadoop was designed to be parallel and resilient. It redefines the way that data is managed and processed by leveraging the power of computing resources composed of commodity hardware. And it can automatically recover from failures.

How to do it…

Use the following steps to build a Hadoop-based Big Data platform:

Design, implement, and deploy data collection or aggregation subsystems. The subsystems should transfer data from different data sources to Hadoop-compatible data storage systems such as HDFS and HBase.
The subsystems need to be designed based on the input properties of a Big Data problem, including volume, velocity, and variety.
Design, implement, and deploy Hadoop Big Data processing platform. The platform should consume the Big Data located on HDFS or HBase and produce the expected and valuable output.
Design, implement, and deploy result delivery subsystems. The delivery subsystems should transform the analytical results from a Hadoop-compatible format to a proper format for end users. For example, we can design web applications to visualize the analytical results using charts, graphs, or other types of dynamic web applications.

How it works…

The architecture of a Hadoop-based Big Data system can be described with the following chart:

Although Hadoop borrows its idea from Google's MapReduce, it is more than MapReduce. A typical Hadoop-based Big Data platform includes the Hadoop Distributed File System (HDFS), the parallel computing framework (MapReduce), common utilities, a column-oriented data storage table (HBase), high-level data management systems (Pig and Hive), a Big Data analytics library (Mahout), a distributed coordination system (ZooKeeper), a workflow management module (Oozie), data transfer modules such as Sqoop, data aggregation modules such as Flume, and data serialization modules such as Avro.

HDFS is the default filesystem of Hadoop. It was designed as a distributed filesystem that provides high-throughput access to application data. Data on HDFS is stored as data blocks. The data blocks are replicated on several computing nodes and their checksums are computed. In case of a checksum error or system failure, erroneous or lost data blocks can be recovered from backup blocks located on other nodes.

MapReduce provides a programming model that transforms complex computations into computations over a set of key-value pairs. It coordinates the processing of tasks on a cluster of nodes by scheduling jobs, monitoring activity, and re-executing failed tasks.

In a typical MapReduce job, multiple map tasks on slave nodes are executed in parallel, generating results buffered on local machines. Once some or all of the map tasks have finished, the shuffle process begins, which aggregates the map task outputs by sorting and combining key-value pairs based on keys. Then, the shuffled data partitions are copied to reducer machine(s), most commonly, over the network. Then, reduce tasks will run on the shuffled data and generate final (or intermediate, if multiple consecutive MapReduce jobs are pipelined) results. When a job finishes, final results will reside in multiple files, depending on the number of reducers used in the job. The anatomy of the job flow can be described in the following chart:

There's more...

HDFS has two types of nodes, NameNode and DataNode. A NameNode keeps track of the filesystem metadata such as the locations of data blocks. For efficiency reasons, the metadata is kept in the main memory of a master machine. A DataNode holds physical data blocks and communicates with clients for data reading and writing. In addition, it periodically reports a list of its hosting blocks to the NameNode in the cluster for verification and validation purposes.

The MapReduce framework has two types of nodes, master node and slave node. JobTracker is the daemon on a master node, and TaskTracker is the daemon on a slave node. The master node is the manager node of MapReduce jobs. It splits a job into smaller tasks, which will be assigned by the JobTracker to TaskTrackers on slave nodes to run. When a slave node receives a task, its TaskTracker will fork a Java process to run the task. Meanwhile, the TaskTracker is also responsible for tracking and reporting the progress of individual tasks.

Hadoop common

Hadoop common is a collection of components and interfaces for the foundation of Hadoop-based Big Data platforms. It provides the following components:

Distributed filesystem and I/O operation interfaces
General parallel computation interfaces
Logging
Security management

Apache HBase

Apache HBase is an open source, distributed, versioned, and column-oriented data store. It was built on top of Hadoop and HDFS. HBase supports random, real-time access to Big Data. It can scale to host very large tables, containing billions of rows and millions of columns. More documentation about HBase can be obtained from http://hbase.apache.org.

Apache Mahout

Apache Mahout is an open source scalable machine learning library based on Hadoop. It has a very active community and is still under development. Currently, the library supports four use cases: recommendation mining, clustering, classification, and frequent item set mining. More documentation of Mahout can be obtained from http://mahout.apache.org.

Apache Pig

Apache Pig is a high-level system for expressing Big Data analysis programs. It supports Big Data by compiling the Pig statements into a sequence of MapReduce jobs. Pig uses Pig Latin as the programming language, which is extensible and easy to use. More documentation about Pig can be found from http://pig.apache.org.

Apache Hive

Apache Hive is a high-level system for the management and analysis of Big Data stored in Hadoop-based systems. It uses a SQL-like language called HiveQL. Similar to Apache Pig, the Hive runtime engine translates HiveQL statements into a sequence of MapReduce jobs for execution. More information about Hive can be obtained from http://hive.apache.org.

Apache ZooKeeper

Apache ZooKeeper is a centralized coordination service for large scale distributed systems. It maintains the configuration and naming information and provides distributed synchronization and group services for applications in distributed systems. More documentation about ZooKeeper can be obtained from http://zookeeper.apache.org.

Apache Oozie

Apache Oozie is a scalable workflow management and coordination service for Hadoop jobs. It is data aware and coordinates jobs based on their dependencies. In addition, Oozie has been integrated with Hadoop and can support all types of Hadoop jobs. More information about Oozie can be obtained from http://oozie.apache.org.

Apache Sqoop

Apache Sqoop is a tool for moving data between Apache Hadoop and structured data stores such as relational databases. It provides command-line suites to transfer data from relational database to HDFS and vice versa. More information about Apache Sqoop can be found at http://sqoop.apache.org.

Apache Flume

Apache Flume is a tool for collecting log data in distributed systems. It has a flexible yet robust and fault tolerant architecture that streams data from log servers to Hadoop. More information can be obtained from http://flume.apache.org.

Apache Avro

Apache Avro is a fast, feature rich data serialization system for Hadoop. The serialized data is coupled with the data schema, which facilitates its processing with different programming languages. More information about Apache Avro can be found at http://avro.apache.org.

Hadoop Operations and Cluster Management Cookbook

By : Shumin Guo

Hadoop Operations and Cluster Management Cookbook

By: Shumin Guo

Overview of this book

Related Content you might be interested in

Current Title:

Hadoop Operations and Cluster Management Cookbook

Building a Hadoop-based Big Data platform

Getting ready

How to do it…

How it works…

There's more...

Hadoop common

Apache HBase

Apache Mahout

Apache Pig

Apache Hive

Apache ZooKeeper

Apache Oozie

Apache Sqoop

Apache Flume

Apache Avro