Scaling Big Data with Hadoop and Solr, Second Edition

Apache Hadoop enables the distributed processing of large datasets across a commodity of clustered servers. It is designed to scale up from a single server to thousands of commodity hardware machines, each offering partial computational units and data storage.

The Apache Hadoop system comes with the following primary components:

Hadoop Distributed File System (HDFS)
MapReduce framework

The Apache Hadoop distributed file system or HDFS provides a file system that can be used to store data in a replicated and distributed manner across various nodes, which are part of the Hadoop cluster. Apache Hadoop provides a distributed data processing framework for large datasets by using a simple programming model called MapReduce.

Note

A programming task that takes a set of data (key-value pair) and converts it into another set of data, is called Map Task. The results of map tasks are combined into one or many Reduce Tasks. Overall, this approach towards computing tasks is called the MapReduce approach.

The MapReduce programming paradigm forms the heart of the Apache Hadoop framework, and any application that is deployed on this framework must comply with MapReduce programming. The following figure demonstrates how MapReduce can be used to sort input documents with the MapReduce approach:

MapReduce can also be used to transform data from a domain into the corresponding range. We are going to look at these in more detail in the following chapters.

Hadoop has been used in environments where data from various sources needs to be processed using large server farms. Hadoop is capable of running its cluster of nodes on commodity hardware, and does not demand any high-end server configuration. With this, Hadoop also brings scalability that enables administrators to add and remove nodes dynamically. Some of the most notable users of Hadoop are companies like Google (in the past), Facebook, and Yahoo, who process petabytes of data every day, and produce rich analytics to the consumer in the shortest possible time. All this is supported by a large community of users who consistently develop and enhance Hadoop every day. Apache Hadoop 2.0 onwards uses YARN (which stands for Yet Another Resource Negotiator).

Note

The Apache Hadoop 1.X MapReduce framework used concepts of job tracker and task tracker. If you are using the older Hadoop versions, it is recommended to move to Hadoop 2.x, which uses advanced MapReduce (also called 2.0). This was released in 2013.

Core components

The following diagram demonstrates how the core components of Apache Hadoop work together to ensure distributed exaction of user jobs:

The Resource Manager (RM) in a Hadoop system is responsible for globally managing the resources of a cluster. Besides managing resources, it coordinates the allocation of resources on the cluster. RM consists of Scheduler and ApplicationsManager. As the names suggest, Scheduler provides resource allocation, whereas ApplicationsManager is responsible for client interactions (accepting jobs and identifying and assigning them to Application Masters).

The Application Master (AM) works for a complete application lifecycle, that is, the life of each MapReduce job. It interacts with RM to negotiate for resources.

The Node Manager (NM) is responsible for the management of all containers that run on a given node. It keeps a watch on resource usage (CPU, memory, and so on), and reports the resource health consistently to the resource manager.

All the metadata related to HDFS is stored on NameNode. The NameNode is the master node that performs coordination activities among data nodes, such as data replication across data nodes, naming system such as filenames, and the disk locations. NameNode stores the mapping of blocks on the Data Nodes. In a Hadoop cluster, there can only be one single active NameNode. NameNode regulates access to its file system with the use of HDFS-based APIs to create, open, edit, and delete HDFS files.

Earlier, NameNode, due to its functioning, was identified as the single point of failure in a Hadoop system. To compensate for this, the Hadoop framework introduced SecondaryNameNode, which constantly syncs with NameNode and can take over whenever NameNode is unavailable.

DataNodes are nothing but slaves that are deployed on all the nodes in a Hadoop cluster. DataNode is responsible for storing the application's data. Each uploaded data file in HDFS is split into multiple blocks, and these data blocks are stored on different data nodes. The default file block size in HDFS is 64 MB. Each Hadoop file block is mapped to two files in the data node; one file is the file block data, while the other is checksum.

When Hadoop is started, each DataNode connects to NameNode informing it of its availability to serve the requests. When the system is started, the namespace ID and software versions are verified by NameNode and DataNode sends the block report describing all the data blocks it holds for NameNode on startup. During runtime, each DataNode periodically sends a heartbeat signal to NameNode, confirming its availability. The default duration between two heartbeats is 3 seconds. NameNode assumes the unavailability of DataNode if it does not receive a heartbeat in 10 minutes by default; in which case, NameNode replicates the data blocks of that DataNode to other DataNodes.

When a client submits a job to Hadoop, the following activities take place:

Application manager launches AM to a given client job/application after negotiating with a specific node.
The AM, once booted, registers itself with the RM. All the client communication with AM happens through RM.
AM launches the container with help of NodeManager.
A container that is responsible for executing a MapReduce task reports the progress status to the AM through an application-specific protocol.
On receiving any request for data access on HDFS, NameNode takes the responsibility of returning to the nearest location of DataNode from its repository.

Understanding Hadoop's ecosystem

Although Hadoop provides excellent storage capabilities along with the MapReduce programming framework, it is still a challenging task to transform conventional programming into a MapReduce type of paradigm, as MapReduce is a completely different programming paradigm. The Hadoop ecosystem is designed to provide a set of rich applications and development framework. The following block diagram shows Apache Hadoop's ecosystem:

We have already seen MapReduce, HDFS, and YARN. Let us look at each of the blocks.

HDFS is an append-only file system; it does not allow data modification. Apache HBase is a distributed, random-access, and column-oriented database. HBase directly runs on top of HDFS and allows application developers to read-write the HDFS data directly. HBase does not support SQL; hence, it is also called a NoSQL database. However, it provides a command line-based interface, as well as a rich set of APIs to update the data. The data in HBase gets stored as key-value pairs in HDFS.

Apache Pig provides another abstraction layer on top of MapReduce. It's a platform for the analysis of very large datasets that runs on HDFS. It also provides an infrastructure layer, consisting of a compiler that produces sequences of MapReduce programs, along with a language layer consisting of the query language Pig Latin. Pig was initially developed at Yahoo! Research to enable developers to create ad-hoc MapReduce jobs for Hadoop. Since then, many big organizations such as eBay, LinkedIn, and Twitter have started using Apache Pig.

Apache Hive provides data warehouse capabilities using big data. Hive runs on top of Apache Hadoop and uses HDFS for storing its data. The Apache Hadoop framework is difficult to understand, and requires a different approach from traditional programming to write MapReduce-based programs. With Hive, developers do not write MapReduce at all. Hive provides an SQL-like query language called HiveQL to application developers, enabling them to quickly write ad-hoc queries similar to RDBMS SQL queries.

Apache Hadoop nodes communicate with each other through Apache ZooKeeper. It forms a mandatory part of the Apache Hadoop ecosystem. Apache ZooKeeper is responsible for maintaining co-ordination among various nodes. Besides coordinating among nodes, it also maintains configuration information and the group services to the distributed system. Apache ZooKeeper can be used independent of Hadoop, unlike other components of the ecosystem. Due to its in-memory management of information, it offers distributed co-ordination at a high speed.

Apache Mahout is an open source machine learning software library that can effectively empower Hadoop users with analytical capabilities, such as clustering and data mining, over a distributed Hadoop cluster. Mahout is highly effective over large datasets; the algorithms provided by Mahout are highly optimized to run the MapReduce framework over HDFS.

Apache HCatalog provides metadata management services on top of Apache Hadoop. It means that all the software that runs on Hadoop can effectively use HCatalog to store the corresponding schemas in HDFS. HCatalog helps any third-party software to create, edit, and expose (using REST APIs) the generated metadata or table definitions. So, any users or scripts can run on Hadoop effectively without actually knowing where the data is physically stored on HDFS. HCatalog provides DDL (which stands for Data Definition Language) commands with which the requested MapReduce, Pig, and Hive jobs can be queued for execution, and later monitored for progress as and when required.

Apache Ambari provides a set of tools to monitor the Apache Hadoop cluster, hiding the complexities of the Hadoop framework. It offers features such as installation wizard, system alerts and metrics, provisioning and management of the Hadoop cluster, and job performances. Ambari exposes RESTful APIs to administrators to allow integration with any other software. Apache Oozie is a workflow scheduler used for Hadoop jobs. It can be used with MapReduce as well as Pig scripts to run the jobs. Apache Chukwa is another monitoring application for distributed large systems. It runs on top of HDFS and MapReduce.

Apache Sqoop is a tool designed to load large datasets into Hadoop efficiently. Apache Sqoop allows application developers to import/export easily from specific data sources, such as relational databases, enterprise data warehouses, and custom applications. Apache Sqoop internally uses a map task to perform data import/export effectively on a Hadoop cluster. Each mapper loads/unloads a slice of data across HDFS and a data source. Apache Sqoop establishes connectivity between non-Hadoop data sources and HDFS.

Apache Flume provides a framework to populate Hadoop with data from non-conventional data sources. Typical usage of Apache Fume could be for log aggregation. Apache Flume is a distributed data collection service that extracts data from the heterogeneous sources, aggregates the data, and stores it into the HDFS. Most of the time, Apache Flume is used as an ETL (which stands for Extract-Transform-Load) utility at various implementations of the Hadoop cluster.

Scaling Big Data with Hadoop and Solr, Second Edition

By : Hrishikesh Vijay Karambelkar

Scaling Big Data with Hadoop and Solr, Second Edition

By: Hrishikesh Vijay Karambelkar

Overview of this book

Related Content you might be interested in

Current Title:

Scaling Big Data with Hadoop and Solr, Second Edition

Apache Hadoop's ecosystem

Note

Note

Core components

Understanding Hadoop's ecosystem