Mastering Hadoop

Book Image

Mastering Hadoop

By : Sandeep Karanth

Book Image

Mastering Hadoop

By: Sandeep Karanth

Overview of this book

Mastering Hadoop

Mastering Hadoop

Credits

About the Author

About the Author

Acknowledgments

Acknowledgments

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Hadoop 2.X

The inception of Hadoop

The evolution of Hadoop

Hadoop distributions

Advanced MapReduce

Advanced MapReduce

MapReduce input

The RecordReader class

Hadoop's "small files" problem

Filtering inputs

The Reduce task

MapReduce output

MapReduce job counters

Handling data joins

Advanced Pig

Different modes of execution

Complex data types in Pig

Compiling Pig scripts

Development and debugging aids

The advanced Pig operators

User-defined functions

Pig performance optimizations

Advanced Hive

The Hive architecture

Hive query optimizers

UDF, UDAF, and UDTF

Serialization and Hadoop I/O

Serialization and Hadoop I/O

Data serialization in Hadoop

Avro serialization

YARN – Bringing Other Paradigms to Hadoop

YARN – Bringing Other Paradigms to Hadoop

The YARN architecture

Developing YARN applications

Monitoring YARN

Job scheduling in YARN

Storm on YARN – Low Latency Processing in Hadoop

Storm on YARN – Low Latency Processing in Hadoop

Batch processing versus streaming

Hadoop on the Cloud

Hadoop on the Cloud

Cloud computing characteristics

Hadoop on the cloud

Amazon Elastic MapReduce (EMR)

HDFS Replacements

HDFS Replacements

HDFS – advantages and drawbacks

Implementing a filesystem in Hadoop

Implementing an S3 native filesystem in Hadoop

HDFS Federation

HDFS Federation

Limitations of the older HDFS architecture

Architecture of HDFS Federation

HDFS high availability

HDFS block placement

Hadoop Security

Hadoop Security

The security pillars

Authentication in Hadoop

Authorization in Hadoop

Data confidentiality in Hadoop

Audit logging in Hadoop

Analytics Using Hadoop

Analytics Using Hadoop

Data analytics workflow

Machine learning

Document analysis using Hadoop and Mahout

Hadoop for Microsoft Windows

Hadoop for Microsoft Windows

Deploying Hadoop on Microsoft Windows

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Chapter 1. Hadoop 2.X

	"There's nothing that cannot be found through some search engine or on the Internet somewhere."
	--Eric Schmidt, Executive Chairman, Google

Hadoop is the de facto open source framework used in the industry for large scale, massively parallel, and distributed data processing. It provides a computation layer for parallel and distributed computation processing. Closely associated with the computation layer is a highly fault-tolerant data storage layer, the Hadoop Distributed File System (HDFS). Both the computation and data layers run on commodity hardware, which is inexpensive, easily available, and compatible with other similar hardware.

In this chapter, we will look at the journey of Hadoop, with a focus on the features that make it enterprise-ready. Hadoop, with 6 years of development and deployment under its belt, has moved from a framework that supports the MapReduce paradigm exclusively to a more generic cluster-computing framework. This chapter covers the following topics:

An outline of Hadoop's code evolution, with major milestones highlighted
An introduction to the changes that Hadoop has undergone as it has moved from 1.X releases to 2.X releases, and how it is evolving into a generic cluster-computing framework
An introduction to the options available for enterprise-grade Hadoop, and the parameters for their evaluation
An overview of a few popular enterprise-ready Hadoop distributions