Book Image

Apache Hadoop 3 Quick Start Guide

By : Hrishikesh Vijay Karambelkar
Book Image

Apache Hadoop 3 Quick Start Guide

By: Hrishikesh Vijay Karambelkar

Overview of this book

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics, including MapReduce, YARN, and HDFS. The book begins with an overview of big data and Apache Hadoop. Then, you will set up a pseudo Hadoop development environment and a multi-node enterprise Hadoop cluster. You will see how the parallel programming paradigm, such as MapReduce, can solve many complex data processing problems. The book also covers the important aspects of the big data software development lifecycle, including quality assurance and control, performance, administration, and monitoring. You will then learn about the Hadoop ecosystem, and tools such as Kafka, Sqoop, Flume, Pig, Hive, and HBase. Finally, you will look at advanced topics, including real time streaming using Apache Storm, and data analytics using Apache Spark. By the end of the book, you will be well versed with different configurations of the Hadoop 3 cluster.
Table of Contents (10 chapters)

How Apache Hadoop works

The Apache Hadoop framework works on a cluster of nodes. These nodes can be either virtual machines or physical servers. The Hadoop framework is designed to work seamlessly on all types of these systems. The core of Apache Hadoop is based on Java. Each of the components in the Apache Hadoop framework performs different operations. Apache Hadoop is comprised of the following key modules, which work across HDFS, MapReduce, and YARN to provide a truly distributed experience to the applications. The following diagram shows the overall big picture of the Apache Hadoop cluster with key components:

Let's go over the following key components and understand what role they play in the overall architecture:

  • Resource Manager
  • Node Manager
  • YARN Timeline Service
  • NameNode
  • DataNode

Resource Manager

Resource Manager is a key component in the YARN ecosystem. It was introduced in Hadoop 2.X, replacing JobTracker (MapReduce version 1.X). There is one Resource Manager per cluster. Resource Manager knows the location of all slaves in the cluster and their resources, which includes information such as GPUs (Hadoop 3.X), CPU, and memory that is needed for execution of an application. Resource Manager acts as a proxy between the client and all other Hadoop nodes. The following diagram depicts the overall capabilities of Resource Manager:

YARN resource manager handles all RPC such as services that allow clients to submit their jobs for execution and obtain information about clusters and queues and termination of jobs. In addition to regular client requests, it provides separate administration services, which get priorities over normal services. Similarly, it also keeps track of available resources and heartbeats from Hadoop nodes. Resource Manager communicates with Application Masters to manage registration/termination of an Application Master, as well as checking health. Resource Manager can be communicated through the following mechanisms:

  • RESTful APIs
  • User interface (New Web UI)
  • Command-line interface (CLI)

These APIs provide information such as cluster health, performance index on a cluster, and application-specific information. Application Manager is the primary interacting point for managing all submitted applications. YARN Schedule is primarily used to schedule jobs with different strategies. It supports strategies such as capacity scheduling and fair scheduling for running applications. Another new feature of resource manager is to provide a fail-over with near zero downtime for all users. We will be looking at more details on resource manager in Chapter 5, Building Rich YARN Applications on YARN.

Node Manager

As the name suggests, Node Manager runs on each of the Hadoop slave nodes participating in the cluster. This means that there could many Node Managers present in a cluster when that cluster is running with several nodes. The following diagram depicts key functions performed by Node Manager:

Node Manager runs different services to determine and share the health of the node. If any services fail to run on a node, Node Manager marks it as unhealthy and reports it back to resource manager. In addition to managing the life cycles of nodes, it also looks at available resources, which include memory and CPU. On startup, Node Manager registers itself to resource manager and sends information about resource availability. One of the key responsibilities of Node Manager is to manage containers running on a node through its Container Manager. These activities involve starting a new container when a request is received from Application Master and logging the operations performed on container. It also keeps tabs on the health of the node.

Application Master is responsible for running one single application. It is initiated based on the new application submitted to a Hadoop cluster. When a request to execute an application is received, it demands container availability from resource manager to execute a specific program. Application Master is aware of execution logic and it is usually specific for frameworks. For example, Apache Hadoop MapReduce has its own implementation of Application Master.

YARN Timeline Service version 2

This service is responsible for collecting different metric data through its timeline collectors, which run in a distributed manner across Hadoop cluster. This collected information is then written back to storage. These collectors exist along with Application Masters—one per application. Similar to Application Manager, resource managers also utilize these timeline collectors to log metric information in the system. YARN Timeline Server version 2.X provides a RESTful API service to allow users to run queries for getting insights on this data. It supports aggregation of information. Timeline Server V2 utilizes Apache HBase as storage for these metrics by default, however, users can choose to change it.

NameNode

NameNode is the gatekeeper for all HDFS-related queries. It serves as a single point for all types of coordination on HDFS data, which is distributed across multiple nodes. NameNode works as a registry to maintain data blocks that are spread across Data Nodes in the cluster. Similarly, the secondary NameNodes keep a backup of active Name Node data periodically (typically every four hours). In addition to maintaining the data blocks, NameNode also maintains the health of each DataNode through the heartbeat mechanism. In any given Hadoop cluster, there can only be one active name node at a time. When an active NameNode goes down, the secondary NameNode takes up responsibility. A filesystem in HDFS is inspired from Unix-like filesystem data structures. Any request to create, edit, or delete HDFS files first gets recorded in journal nodes; journal nodes are responsible for coordinating with data nodes for propagating changes. Once the writing is complete, changes are flushed and a response is sent back to calling APIs. In case the flushing of changes in the journal files fails, the NameNode moves on to another node to record changes.

NameNode used to be single point of failure in Hadoop 1.X; however, in Hadoop 2.X, the secondary name node was introduced to handle the failure condition. In Hadoop 3.X, more than one secondary name node is supported. The same has been depicted in the overall architecture diagram.

DataNode

DataNode in the Hadoop ecosystem is primarily responsible for storing application data in distributed and replicated form. It acts as a slave in the system and is controlled by NameNode. Each disk in the Hadoop system is divided into multiple blocks, just like a traditional computer storage device. A block is a minimal unit in which the data can be read or written by the Hadoop filesystem. This ecosystem gives a natural advantage in slicing large files into these blocks and storing them across multiple nodes. The default block size of data node varies from 64 MB to 128 MB, depending upon Hadoop implementation. This can be changed through the configuration of data node. HDFS is designed to support very large file sizes and for write-once-read-many-based semantics.

Data nodes are primarily responsible for storing and retrieving these blocks when they are requested by consumers through Name Node. In Hadoop version 3.X, DataNode not only stores the data in blocks, but also the checksum or parity of the original blocks in a distributed manner. DataNodes follow the replication pipeline mechanism to store data in chunks propagating portions to other data nodes.

When a cluster starts, NameNode starts in a safe mode, until the data nodes register the data block information with NameNode. Once this is validated, it starts engaging with clients for serving the requests. When a data node starts, it first connects with Name Node, reporting all of the information about its data blocks' availability. This information is registered in NameNode, and when a client requests information about a certain block, NameNode points to the respective data not from its registry. The client then interacts with DataNode directly to read/write the data block. During the cluster processing, data node communicates with name node periodically, sending a heartbeat signal. The frequency of the heartbeat can be configured through configuration files.

We have gone through different key architecture components of the Apache Hadoop framework; we will be getting a deeper understanding in each of these areas in the next chapters.