Mastering Hadoop 3

Mastering Hadoop 3

By : Chanchal Singh, Manish Kumar

Buy this Book

Mastering Hadoop 3

By: Chanchal Singh, Manish Kumar

Buy this Book

Overview of this book

Apache Hadoop is one of the most popular big data solutions for distributed storage and for processing large chunks of data. With Hadoop 3, Apache promises to provide a high-performance, more fault-tolerant, and highly efficient big data processing platform, with a focus on improved scalability and increased efficiency. With this guide, you’ll understand advanced concepts of the Hadoop ecosystem tool. You’ll learn how Hadoop works internally, study advanced concepts of different ecosystem tools, discover solutions to real-world use cases, and understand how to secure your cluster. It will then walk you through HDFS, YARN, MapReduce, and Hadoop 3 concepts. You’ll be able to address common challenges like using Kafka efficiently, designing low latency, reliable message delivery Kafka systems, and handling high data volumes. As you advance, you’ll discover how to address major challenges when building an enterprise-grade messaging system, and how to use different stream processing systems along with Kafka to fulfil your enterprise goals. By the end of this book, you’ll have a complete understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable data pipeline, and you’ll be equipped to tackle a range of real-world problems in data pipelines.

Title Page

Dedication

About Packt

Foreword

Contributors

Preface

Free Chapter

Journey to Hadoop 3

Hadoop origins and Timelines

Overview of Hadoop 3 and its features

Hadoop logical view

Hadoop distributions

Points to remember

Summary

Deep Dive into the Hadoop Distributed File System

Technical requirements

Defining HDFS

Deep dive into the HDFS architecture

NameNode internals

DataNode internals

Quorum Journal Manager (QJM)

HDFS high availability in Hadoop 3.x

Data management

HDFS reads and writes

Managing disk-skewed data in Hadoop 3.x

Lazy persist writes in HDFS

Erasure encoding in Hadoop 3.x

HDFS common interfaces

HDFS command reference

Points to remember

Summary

YARN Resource Management in Hadoop

Architecture

Introduction to YARN job scheduling

FIFO scheduler

Capacity scheduler

Fair scheduler

Resource Manager high availability

Node labels

YARN Timeline server in Hadoop 3.x

Opportunistic containers in Hadoop 3.x

Docker containers in YARN

YARN REST APIs

YARN command reference

Summary

Internals of MapReduce

Technical requirements

Deep dive into the Hadoop MapReduce framework

YARN and MapReduce

MapReduce workflow in the Hadoop framework

Common MapReduce patterns

MapReduce use case

Optimizing MapReduce

Summary

SQL on Hadoop

Technical requirements

Presto – introduction

Hive

Impala

Summary

Real-Time Processing Engines

Technical requirements

Spark

Apache Flink

Storm/Heron

Summary

Widely Used Hadoop Ecosystem Components

Technical requirements

Pig

HBase

Kafka

Flume

Summary

Designing Applications in Hadoop

Technical requirements

Common batch processing pattern

Airflow for orchestration

Data governance

Summary

Real-Time Stream Processing in Hadoop

Technical requirements

What are streaming datasets?

Stream data ingestion

Common stream data processing patterns

Streaming design considerations

Micro-batch processing case study

Real-time processing case study

Summary

Machine Learning in Hadoop

Technical requirements

Machine learning steps

Common machine learning challenges

Spark machine learning

Hadoop and R

Mahout

Machine learning case study in Spark

Summary

Hadoop in the Cloud

Technical requirements

Logical view of Hadoop in the cloud

Network

Managing resources

Data pipelines

High availability (HA)

Summary

Hadoop Cluster Profiling

Introduction to benchmarking and profiling

HDFS

NameNode

YARN

Hive

Mix-workloads

Summary

Who Can Do What in Hadoop

Hadoop security pillars

System security

Kerberos authentication

User authorization

List of security features that have been worked upon in Hadoop 3.0

Summary

Network and Data Security

Securing Hadoop networks

Encryption

Masking

Filtering

Summary

Monitoring Hadoop

General monitoring

Security monitoring

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Hadoop origins and Timelines

Hadoop is changing the way people think about data. We need to know what led to the origin of this magical innovation. Who developed Hadoop and why? What problems existed before Hadoop? How has it solved these problems? What challenges were encountered during development? How has Hadoop transformed from version 1 to version 3? Let's walk through the origins of Hadoop and its journey to version 3.

Origins

In 1997, Doug Cutting, a co-founder of Hadoop, started working on project Lucene, which is a full-text search library. It was completely written in Java and is a full-text search engine. It analyzes text and builds an index on it. An index is just a mapping of text to locations, so it quickly gives all locations matching particular search patterns. After a few years, Doug made the Lucene project open source; it got a tremendous response from the community and it later became the Apache foundation project.

Once Doug realized that he had enough people who can look into Lucene, he started focusing on indexing web pages. Mike Cafarella joined him for this project to develop a product that can index web pages, and they named this project Apache Nutch. Apache Nutch was also known to be a subproject of Apache Lucene, as Nutch uses the Lucene library to index the content of web pages. Fortunately, with hard work, they made good progress and deployed Nutch on a single machine that was able to index around 100 pages per second.

Scalability is something that people often don't consider while developing initial versions of applications. This was also true of Doug and Mike and the number of web pages that could be indexed was limited to 100 million. In order to index more pages, they increased the number of machines. However, increasing nodes resulted in operational problems because they did not have any underlying cluster manager to perform operational tasks. They wanted to focus more on optimizing and developing robust Nutch applications without worrying about scalability issues.

Doug and Mike wanted a system that had the following features:

Fault tolerant: The system should be able to handle any failure of the machines automatically, in an isolated manner. This means the failure of one machine should not affect the entire application.
Load balancing: If one machine fails, then its work should be distributed automatically to the working machines in a fair manner.
Data loss: They also wanted to make sure that, once data is written to disk, it should never be lost even if one or two machines fail.

They started working on developing a system that can fulfill the aforementioned requirements and spent a few months doing so. However, at the same time, Google published its Google File System. When they read about it, they found it had solutions to similar problems they were trying to solve. They decided to make an implementation based on this research paper and started the development of Nutch Distributed File System (NDFS), which they completed in 2004.

With the help of the Google File System, they solved the scalability and fault tolerance problem that we discussed previously. They used the concept of blocks and replication to do so. Blocks are created by splitting each file into 64 MB chunks (the size is configurable) and replicating each block three times by default so that, if a machine holding one block fails, then data can be served from another machine. The implementation helped them solve all the operational problems they were trying to solve for Apache Nutch. The next section explains the origin of MapReduce.

MapReduce origin

Doug and Mike started working on an algorithm that can process data stored on NDFS. They wanted a system whose performance can be doubled by just doubling the number of machines running the program. At the same time, Google published MapReduce: Simplified Data Processing on Large Clusters (https://research.google.com/archive/mapreduce.html).

The core idea behind the MapReduce model was to provide parallelism, fault tolerance, and data locality features. Data locality means a program is executed where data is stored instead of bringing the data to the program. MapReduce was integrated into Nutch in 2005. In 2006, Doug created a new incubating project that consisted of HDFS (Hadoop Distributed File System), named after NDFS, MapReduce, and Hadoop Common.

At that time, Yahoo! was struggling with its backend search performance. Engineers at Yahoo! already knew the benefits of Google File System and MapReduce implemented at Google. Yahoo! decided to adopt the capability of Hadoop and they employed Doug to help their engineering team to do so. In 2007, a few more companies who started contributing to Hadoop and Yahoo! reported that they were running 1,000 node Hadoop clusters at the same time.

NameNodes and DataNodes have a specific role in managing overall clusters. NameNodes are responsible for maintaining metadata information. MapReduce engines have a job tracker and task tracker whose scalability is limited to 40,000 nodes because the overall work of scheduling and tracking is handled by only the job tracker. YARN was introduced in Hadoop version 2 to overcome scalability issues and resource management jobs. It gave Hadoop a new lease of life and Hadoop became a more robust, faster, and more scalable system.

Timelines

We will talk about MapReduce and HDFS in detail later. Let's go through the evolution of Hadoop, which looks as follows:

Year	Event
2003	Research paper for Google File System released
2004	Research paper for MapReduce released
2006	JIRA, mailing list, and other documents created for Hadoop Hadoop Nutch created Hadoop created by moving out NDFS and MapReduce from Nutch Doug Cutting names the project Hadoop, which was the name of his son's yellow elephant toy Release of Hadoop 0.1.0 1.8 TB of data sorts on 188 nodes, which took 47.9 hours Three hundred machines deployed at Yahoo! for the Hadoop cluster Cluster size at Yahoo! increases to 600
2007	Two clusters of 1,000 machines run by Yahoo! Hadoop released with HBase Apache pig created at Yahoo!
2008	JIRA for YARN opened Twenty companies listed on the Powered by Hadoop page Web index at Yahoo! moved to Hadoop 10,000-core Hadoop cluster used to generate Yahoo!'s production search index First ever Hadoop summit World record created for the fastest sorting (one terabyte of data in 209 seconds) by using 910 node Hadoop clusters Hadoop bags record for terabyte sort Hadoop bags record for terabyte sort benchmark The Yahoo! cluster now has 10 TB loaded to Hadoop every day Cloudera founded as a Hadoop distribution company Google claims to sort 1 terabyte in 68 seconds with its MapReduce implementation
2009	24,000 machines with seven clusters run at Yahoo! Hadoop records sorting petabyte storage Yahoo! claims to sort 1 terabyte in 62 seconds Second Hadoop summit Hadoop core renamed Hadoop Common MapR distribution founded HDFS becomes a separate subproject MapReduce becomes a separate subproject
2010	Authentication using Kerberos added to Hadoop Stable version of Apache HBase Yahoo runs 4,000 nodes to process 70 PB Facebook runs 2,300 clusters to process 40 petabytes Apache Hive released Apache Pig released
2011	Apache Zookeeper released Contribution of 200,000 lines of code from Facebook, LinkedIn, eBay, and IBM Media Guardian Innovation Awards Top Prize for Hadoop Hortonworks started by Rob Beardon and Eric Badleschieler, who were core members of Hadoop at Yahoo 42,000 nodes of Hadoop cluster at Yahoo, processing petabytes data
2012	Hadoop community moves to integrate YARN Hadoop Summit at San Jose Hadoop version 1.0 released
2013	Yahoo! deploys YARN in production Hadoop Version 2.2 released
2014	Apache Spark becomes an Apache top-level project Hadoop Version 2.6 released
2015	Hadoop 2.7 released
2017	Hadoop 2.8 released

Hadoop 3 has introduced a few more important changes, which we will discuss in upcoming sections in this chapter.

Mastering Hadoop 3

By : Chanchal Singh, Manish Kumar

Mastering Hadoop 3

By: Chanchal Singh, Manish Kumar

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Hadoop 3

Building Data Streaming Applications with Apache Kafka

Apache Hadoop 3 Quick Start Guide

Hadoop 2.x Administration Cookbook

Hadoop origins and Timelines

Origins

MapReduce origin

Timelines