Sign In Start Free Trial

Book Overview & Buying
Table Of Contents

Mastering Hadoop 3

By : Timothy Wong, Chanchal Singh, Manish Kumar

5 (1)

Mastering Hadoop 3

5 (1)

By: Timothy Wong, Chanchal Singh, Manish Kumar

Overview of this book

Apache Hadoop is one of the most popular big data solutions for distributed storage and for processing large chunks of data. With Hadoop 3, Apache promises to provide a high-performance, more fault-tolerant, and highly efficient big data processing platform, with a focus on improved scalability and increased efficiency. With this guide, you’ll understand advanced concepts of the Hadoop ecosystem tool. You’ll learn how Hadoop works internally, study advanced concepts of different ecosystem tools, discover solutions to real-world use cases, and understand how to secure your cluster. It will then walk you through HDFS, YARN, MapReduce, and Hadoop 3 concepts. You’ll be able to address common challenges like using Kafka efficiently, designing low latency, reliable message delivery Kafka systems, and handling high data volumes. As you advance, you’ll discover how to address major challenges when building an enterprise-grade messaging system, and how to use different stream processing systems along with Kafka to fulfil your enterprise goals. By the end of this book, you’ll have a complete understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable data pipeline, and you’ll be equipped to tackle a range of real-world problems in data pipelines.

Preface

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Free Chapter

Section 1: Introduction to Hadoop 3

Section 1: Introduction to Hadoop 3

Journey to Hadoop 3

Journey to Hadoop 3

Hadoop origins and Timelines

Overview of Hadoop 3 and its features

Hadoop logical view

Hadoop distributions

Points to remember

Summary

Deep Dive into the Hadoop Distributed File System

Deep Dive into the Hadoop Distributed File System

Technical requirements

Defining HDFS

Deep dive into the HDFS architecture

NameNode internals

DataNode internals

Quorum Journal Manager (QJM)

HDFS high availability in Hadoop 3.x

Data management

HDFS reads and writes

Managing disk-skewed data in Hadoop 3.x

Lazy persist writes in HDFS

Erasure encoding in Hadoop 3.x

HDFS common interfaces

HDFS command reference

Points to remember

Summary

YARN Resource Management in Hadoop

YARN Resource Management in Hadoop

Architecture

Introduction to YARN job scheduling

FIFO scheduler

Capacity scheduler

Fair scheduler

Resource Manager high availability

Node labels

YARN Timeline server in Hadoop 3.x

Opportunistic containers in Hadoop 3.x

Docker containers in YARN

YARN REST APIs

YARN command reference

Summary

Internals of MapReduce

Internals of MapReduce

Technical requirements

Deep dive into the Hadoop MapReduce framework

YARN and MapReduce

MapReduce workflow in the Hadoop framework

Common MapReduce patterns

MapReduce use case

Optimizing MapReduce

Summary

Section 2: Hadoop Ecosystem

Section 2: Hadoop Ecosystem

SQL on Hadoop

SQL on Hadoop

Technical requirements

Presto – introduction

Hive

Impala

Summary

Real-Time Processing Engines

Real-Time Processing Engines

Technical requirements

Spark

Apache Flink

Storm/Heron

Summary

Widely Used Hadoop Ecosystem Components

Widely Used Hadoop Ecosystem Components

Technical requirements

Pig

HBase

Kafka

Flume

Summary

Section 3: Hadoop in the Real World

Section 3: Hadoop in the Real World

Designing Applications in Hadoop

Designing Applications in Hadoop

Technical requirements

File formats

Data compression

Serialization

Data ingestion

Data processing

Common batch processing pattern

Airflow for orchestration

Data governance

Summary

Real-Time Stream Processing in Hadoop

Real-Time Stream Processing in Hadoop

Technical requirements

What are streaming datasets?

Stream data ingestion

Common stream data processing patterns

Streaming design considerations

Micro-batch processing case study

Real-time processing case study

Summary

Machine Learning in Hadoop

Machine Learning in Hadoop

Technical requirements

Machine learning steps

Common machine learning challenges

Spark machine learning

Hadoop and R

Mahout

Machine learning case study in Spark

Summary

Hadoop in the Cloud

Hadoop in the Cloud

Technical requirements

Logical view of Hadoop in the cloud

Network

Managing resources

Data pipelines

High availability (HA)

Summary

Hadoop Cluster Profiling

Hadoop Cluster Profiling

Introduction to benchmarking and profiling

HDFS

NameNode

YARN

Hive

Mix-workloads

Summary

Section 4: Securing Hadoop

Section 4: Securing Hadoop

Who Can Do What in Hadoop

Who Can Do What in Hadoop

Hadoop security pillars

System security

Kerberos authentication

User authorization

List of security features that have been worked upon in Hadoop 3.0

Summary

Network and Data Security

Network and Data Security

Securing Hadoop networks

Encryption

Masking

Filtering

Summary

Monitoring Hadoop

Monitoring Hadoop

General monitoring

Security monitoring

Summary

Other Books You May Enjoy

Other Books You May Enjoy

Leave a review - let other readers know what you think

Journey to Hadoop 3

Hadoop has come a long way since its inception. Powered by a community of open source enthusiasts, it has seen three major version releases. The version 1 release saw the light of day six years after the first release of Hadoop. With this release, the Hadoop platform had full capabilities that can run MapReduce-distributed computing on Hadoop Distributed File System (HDFS) distributed storage. It had some of the most major performance improvements ever done, along with full support for security. This release also enjoyed a lot of improvements with respect to HBASE.

The version 2 release made significant leaps compared to version 1 of Hadoop. It introduced YARN, a sophisticated general-purpose resource manager and job scheduling component. HDFS high availability, HDFS federations, and HDFS snapshots were some other prominent features introduced in version 2...

CONTINUE READING

83

Tech Concepts

36

Programming languages

73

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

Mastering Hadoop 3

Search

Your notes and bookmarks