Book Image

Hadoop Operations and Cluster Management Cookbook

By : Shumin Guo
Book Image

Hadoop Operations and Cluster Management Cookbook

By: Shumin Guo

Overview of this book

<p>We are facing an avalanche of data. The unstructured data we gather can contain many insights that could hold the key to business success or failure. Harnessing the ability to analyze and process this data with Hadoop is one of the most highly sought after skills in today's job market. Hadoop, by combining the computing and storage powers of a large number of commodity machines, solves this problem in an elegant way!</p> <p>Hadoop Operations and Cluster Management Cookbook is a practical and hands-on guide for designing and managing a Hadoop cluster. It will help you understand how Hadoop works and guide you through cluster management tasks.</p> <p>This book explains real-world, big data problems and the features of Hadoop that enables it to handle such problems. It breaks down the mystery of a Hadoop cluster and will guide you through a number of clear, practical recipes that will help you to manage a Hadoop cluster.</p> <p>We will start by installing and configuring a Hadoop cluster, while explaining hardware selection and networking considerations. We will also cover the topic of securing a Hadoop cluster with Kerberos, configuring cluster high availability and monitoring a cluster. And if you want to know how to build a Hadoop cluster on the Amazon EC2 cloud, then this is a book for you.</p>
Table of Contents (15 chapters)
Hadoop Operations and Cluster Management Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Building a Hadoop-based Big Data platform


Hadoop was first developed as a Big Data processing system in 2006 at Yahoo! The idea is based on Google's MapReduce, which was first published by Google based on their proprietary MapReduce implementation. In the past few years, Hadoop has become a widely used platform and runtime environment for the deployment of Big Data applications. In this recipe, we will outline steps to build a Hadoop-based Big Data platform.

Getting ready

Hadoop was designed to be parallel and resilient. It redefines the way that data is managed and processed by leveraging the power of computing resources composed of commodity hardware. And it can automatically recover from failures.

How to do it…

Use the following steps to build a Hadoop-based Big Data platform:

  1. Design, implement, and deploy data collection or aggregation subsystems. The subsystems should transfer data from different data sources to Hadoop-compatible data storage systems such as HDFS and HBase.

    The subsystems need to be designed based on the input properties of a Big Data problem, including volume, velocity, and variety.

  2. Design, implement, and deploy Hadoop Big Data processing platform. The platform should consume the Big Data located on HDFS or HBase and produce the expected and valuable output.

  3. Design, implement, and deploy result delivery subsystems. The delivery subsystems should transform the analytical results from a Hadoop-compatible format to a proper format for end users. For example, we can design web applications to visualize the analytical results using charts, graphs, or other types of dynamic web applications.

How it works…

The architecture of a Hadoop-based Big Data system can be described with the following chart:

Although Hadoop borrows its idea from Google's MapReduce, it is more than MapReduce. A typical Hadoop-based Big Data platform includes the Hadoop Distributed File System (HDFS), the parallel computing framework (MapReduce), common utilities, a column-oriented data storage table (HBase), high-level data management systems (Pig and Hive), a Big Data analytics library (Mahout), a distributed coordination system (ZooKeeper), a workflow management module (Oozie), data transfer modules such as Sqoop, data aggregation modules such as Flume, and data serialization modules such as Avro.

HDFS is the default filesystem of Hadoop. It was designed as a distributed filesystem that provides high-throughput access to application data. Data on HDFS is stored as data blocks. The data blocks are replicated on several computing nodes and their checksums are computed. In case of a checksum error or system failure, erroneous or lost data blocks can be recovered from backup blocks located on other nodes.

MapReduce provides a programming model that transforms complex computations into computations over a set of key-value pairs. It coordinates the processing of tasks on a cluster of nodes by scheduling jobs, monitoring activity, and re-executing failed tasks.

In a typical MapReduce job, multiple map tasks on slave nodes are executed in parallel, generating results buffered on local machines. Once some or all of the map tasks have finished, the shuffle process begins, which aggregates the map task outputs by sorting and combining key-value pairs based on keys. Then, the shuffled data partitions are copied to reducer machine(s), most commonly, over the network. Then, reduce tasks will run on the shuffled data and generate final (or intermediate, if multiple consecutive MapReduce jobs are pipelined) results. When a job finishes, final results will reside in multiple files, depending on the number of reducers used in the job. The anatomy of the job flow can be described in the following chart:

There's more...

HDFS has two types of nodes, NameNode and DataNode. A NameNode keeps track of the filesystem metadata such as the locations of data blocks. For efficiency reasons, the metadata is kept in the main memory of a master machine. A DataNode holds physical data blocks and communicates with clients for data reading and writing. In addition, it periodically reports a list of its hosting blocks to the NameNode in the cluster for verification and validation purposes.

The MapReduce framework has two types of nodes, master node and slave node. JobTracker is the daemon on a master node, and TaskTracker is the daemon on a slave node. The master node is the manager node of MapReduce jobs. It splits a job into smaller tasks, which will be assigned by the JobTracker to TaskTrackers on slave nodes to run. When a slave node receives a task, its TaskTracker will fork a Java process to run the task. Meanwhile, the TaskTracker is also responsible for tracking and reporting the progress of individual tasks.

Hadoop common

Hadoop common is a collection of components and interfaces for the foundation of Hadoop-based Big Data platforms. It provides the following components:

  • Distributed filesystem and I/O operation interfaces

  • General parallel computation interfaces

  • Logging

  • Security management

Apache HBase

Apache HBase is an open source, distributed, versioned, and column-oriented data store. It was built on top of Hadoop and HDFS. HBase supports random, real-time access to Big Data. It can scale to host very large tables, containing billions of rows and millions of columns. More documentation about HBase can be obtained from http://hbase.apache.org.

Apache Mahout

Apache Mahout is an open source scalable machine learning library based on Hadoop. It has a very active community and is still under development. Currently, the library supports four use cases: recommendation mining, clustering, classification, and frequent item set mining. More documentation of Mahout can be obtained from http://mahout.apache.org.

Apache Pig

Apache Pig is a high-level system for expressing Big Data analysis programs. It supports Big Data by compiling the Pig statements into a sequence of MapReduce jobs. Pig uses Pig Latin as the programming language, which is extensible and easy to use. More documentation about Pig can be found from http://pig.apache.org.

Apache Hive

Apache Hive is a high-level system for the management and analysis of Big Data stored in Hadoop-based systems. It uses a SQL-like language called HiveQL. Similar to Apache Pig, the Hive runtime engine translates HiveQL statements into a sequence of MapReduce jobs for execution. More information about Hive can be obtained from http://hive.apache.org.

Apache ZooKeeper

Apache ZooKeeper is a centralized coordination service for large scale distributed systems. It maintains the configuration and naming information and provides distributed synchronization and group services for applications in distributed systems. More documentation about ZooKeeper can be obtained from http://zookeeper.apache.org.

Apache Oozie

Apache Oozie is a scalable workflow management and coordination service for Hadoop jobs. It is data aware and coordinates jobs based on their dependencies. In addition, Oozie has been integrated with Hadoop and can support all types of Hadoop jobs. More information about Oozie can be obtained from http://oozie.apache.org.

Apache Sqoop

Apache Sqoop is a tool for moving data between Apache Hadoop and structured data stores such as relational databases. It provides command-line suites to transfer data from relational database to HDFS and vice versa. More information about Apache Sqoop can be found at http://sqoop.apache.org.

Apache Flume

Apache Flume is a tool for collecting log data in distributed systems. It has a flexible yet robust and fault tolerant architecture that streams data from log servers to Hadoop. More information can be obtained from http://flume.apache.org.

Apache Avro

Apache Avro is a fast, feature rich data serialization system for Hadoop. The serialized data is coupled with the data schema, which facilitates its processing with different programming languages. More information about Apache Avro can be found at http://avro.apache.org.