The Map step of a MapReduce job hinges on the nature of the input provided to the job. The Map step provides maximum parallelism gains, and crafting this step smartly is important for job speedup. Data is split into chunks, and Map tasks operate on each of these chunks of data. Each chunk is called
InputSplit. A Map task is asked to operate on each
InputSplit class. There are two other classes,
RecordReader, which are significant in handling inputs to Hadoop jobs.
Splitting the input data into logical chunks (
InputSplit) and assigning each of the splits to a Map task.
RecordReaderobject that can work on each
InputSplitclass and producing records to the Map task...