Mastering Hadoop 3

Book Image

Mastering Hadoop 3

By : Chanchal Singh, Manish Kumar

Book Image

Mastering Hadoop 3

By: Chanchal Singh, Manish Kumar

Overview of this book

Apache Hadoop is one of the most popular big data solutions for distributed storage and for processing large chunks of data. With Hadoop 3, Apache promises to provide a high-performance, more fault-tolerant, and highly efficient big data processing platform, with a focus on improved scalability and increased efficiency. With this guide, you’ll understand advanced concepts of the Hadoop ecosystem tool. You’ll learn how Hadoop works internally, study advanced concepts of different ecosystem tools, discover solutions to real-world use cases, and understand how to secure your cluster. It will then walk you through HDFS, YARN, MapReduce, and Hadoop 3 concepts. You’ll be able to address common challenges like using Kafka efficiently, designing low latency, reliable message delivery Kafka systems, and handling high data volumes. As you advance, you’ll discover how to address major challenges when building an enterprise-grade messaging system, and how to use different stream processing systems along with Kafka to fulfil your enterprise goals. By the end of this book, you’ll have a complete understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable data pipeline, and you’ll be equipped to tackle a range of real-world problems in data pipelines.

Title Page

Dedication

About Packt

Foreword

Contributors

Preface

Free Chapter

Journey to Hadoop 3

Journey to Hadoop 3

Hadoop origins and Timelines

Overview of Hadoop 3 and its features

Hadoop logical view

Hadoop distributions

Points to remember

Deep Dive into the Hadoop Distributed File System

Deep Dive into the Hadoop Distributed File System

Technical requirements

Deep dive into the HDFS architecture

NameNode internals

DataNode internals

Quorum Journal Manager (QJM)

HDFS high availability in Hadoop 3.x

Data management

HDFS reads and writes

Managing disk-skewed data in Hadoop 3.x

Lazy persist writes in HDFS

Erasure encoding in Hadoop 3.x

HDFS common interfaces

HDFS command reference

Points to remember

YARN Resource Management in Hadoop

YARN Resource Management in Hadoop

Introduction to YARN job scheduling

Capacity scheduler

Resource Manager high availability

YARN Timeline server in Hadoop 3.x

Opportunistic containers in Hadoop 3.x

Docker containers in YARN

YARN command reference

Internals of MapReduce

Internals of MapReduce

Technical requirements

Deep dive into the Hadoop MapReduce framework

YARN and MapReduce

MapReduce workflow in the Hadoop framework

Common MapReduce patterns

MapReduce use case

Optimizing MapReduce

SQL on Hadoop

Technical requirements

Presto – introduction

Real-Time Processing Engines

Real-Time Processing Engines

Technical requirements

Widely Used Hadoop Ecosystem Components

Widely Used Hadoop Ecosystem Components

Technical requirements

Designing Applications in Hadoop

Designing Applications in Hadoop

Technical requirements

Data compression

Data processing

Common batch processing pattern

Airflow for orchestration

Data governance

Real-Time Stream Processing in Hadoop

Real-Time Stream Processing in Hadoop

Technical requirements

What are streaming datasets?

Stream data ingestion

Common stream data processing patterns

Streaming design considerations

Micro-batch processing case study

Real-time processing case study

Machine Learning in Hadoop

Machine Learning in Hadoop

Technical requirements

Machine learning steps

Common machine learning challenges

Spark machine learning

Machine learning case study in Spark

Hadoop in the Cloud

Hadoop in the Cloud

Technical requirements

Logical view of Hadoop in the cloud

Managing resources

High availability (HA)

Hadoop Cluster Profiling

Hadoop Cluster Profiling

Introduction to benchmarking and profiling

Who Can Do What in Hadoop

Who Can Do What in Hadoop

Hadoop security pillars

System security

Kerberos authentication

User authorization

List of security features that have been worked upon in Hadoop 3.0

Network and Data Security

Network and Data Security

Securing Hadoop networks

Monitoring Hadoop

Monitoring Hadoop

General monitoring

Security monitoring

Other Books You May Enjoy

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Overview of Hadoop 3 and its features

The first alpha release of Hadoop version 3.0.0 was on 30 August 2016. It was called version 3.0.0-alpha1. This was the first alpha release in a series of planned alphas and betas that ultimately led to 3.0.0 GA. The intention behind this alpha release was to quickly gather and act on feedback from downstream users.

With any such releases, there are some key drivers that lead to its birth. These key drivers create benefits that will ultimately help in the better functioning of Hadoop-augmented enterprise applications. Before we discuss the features of Hadoop 3, you should understand these driving factors. Some driving factors behind the release of Hadoop 3 are as follows:

A lot of bug fixes and performance improvements: Hadoop has a growing open source community of developers regularly adding major/minor changes or improvements to the Hadoop trunk repository. These changes were growing day by day and they couldn't be accommodated in minor version releases of 2.x. They had to be accommodated with a major version release. Hence, it was decided to release the majority of these changes committed to the trunk repository with Hadoop 3.
Overhead due to data replication factor: As you may be aware, HDFS has a default replication factor of 3. This helps make things more fault-tolerant with better data locality and better load balancing of jobs among DataNodes. However, it comes with an overhead cost of around 200%. For non-frequently accessed datasets that have low I/O activities, these replicated blocks are never accessed in the course of normal operations. On the other hand, they consume the same number of resources as other main resources. To mitigate this overhead with non-frequently accessed data, Hadoop 3 introduced a major feature, called erasure coding. This stores data durably while saving space significantly.
Improving existing YARN Timeline services: YARN Timeline service version 1 has limitations that impact reliability, performance, and scalability. For example, it uses local-disk-based LevelDB storage that cannot scale to a high number of requests. Moreover, the Timeline server is a single point of failure. To mitigate such drawbacks, YARN Timeline server has been re-architected with the Hadoop 3 release.
Optimizing map output collector: It is a well-known fact that native code (written correctly) is faster to execute. In lieu of that, some optimization is done in Hadoop 3 that will speed up mapper tasks by approximately two to three times. The native implementation of map output collector has been added, which will be used in the Java-based MapReduce framework using the Java Native Interface (JNI). This is particularly useful for shuffle-intensive operations.
The need for a higher availability factor of NameNode: Hadoop is a fault-tolerant platform with support for handling multiple data node failures. In the case of NameNodes versions, prior to Hadoop version 3 support two NameNodes, Active and Standby. While it is a highly available solution, in the case of the failure of an active (or standby) NameNode, it will go back to a non-HA mode. This is not very accommodative of a high number of failures. In Hadoop 3, support for more than one standby NameNode has been introduced.
Dependency on Linux ephemeral port range: Linux ephemeral ports are short-lived ports created by the OS (operating system) when a process requests any available port. The OS assigns the port number from a predefined range. It then releases the port after the related connection terminates. With version 2 and earlier, many Hadoop services' default ports were in the Linux ephemeral port range. This means starting these services sometimes failed to bind to the port due to conflicts with other processes. In Hadoop 3, these default ports are moved out of the ephemeral port range.
Disk-level data skew: There are multiple disks (or drives) managed by DataNodes. Sometimes, adding or replacing disks leads to significant data skew within a DataNode. To rebalance data among disks within a DataNode, Hadoop 3 has introduced a CLI utility called hdfsdiskbalancer.

Well! Hopefully, by now, you have a clear understanding of why certain features were introduced in Hadoop 3 and what kinds of benefits are derived from them. Throughout this book, we will look into these features in detail. However, our intent in this section was to ensure that you get a high-level overview of the major features introduced in Hadoop 3 and why they were introduced. In the next section, we will look into Hadoop Logical view.