Book Image

Mastering Hadoop

By : Sandeep Karanth
Book Image

Mastering Hadoop

By: Sandeep Karanth

Overview of this book

Table of Contents (21 chapters)
Mastering Hadoop
About the Author
About the Reviewers

MapReduce input

The Map step of a MapReduce job hinges on the nature of the input provided to the job. The Map step provides maximum parallelism gains, and crafting this step smartly is important for job speedup. Data is split into chunks, and Map tasks operate on each of these chunks of data. Each chunk is called InputSplit. A Map task is asked to operate on each InputSplit class. There are two other classes, InputFormat and RecordReader, which are significant in handling inputs to Hadoop jobs.

The InputFormat class

The input data specification for a MapReduce Hadoop job is given via the InputFormat hierarchy of classes. The InputFormat class family has the following main functions:

  • Validating the input data. For example, checking for the presence of the file in the given path.

  • Splitting the input data into logical chunks (InputSplit) and assigning each of the splits to a Map task.

  • Instantiating a RecordReader object that can work on each InputSplit class and producing records to the Map task...