Book Image

Optimizing Hadoop for MapReduce

By : Khaled Tannir
Book Image

Optimizing Hadoop for MapReduce

By: Khaled Tannir

Overview of this book

Table of Contents (15 chapters)

Enhancing map tasks


When executing a MapReduce job, the Hadoop framework will execute the job in a well-defined sequence of processing phases. Except the user-defined functions (map, reduce, and combiner), the execution time of other MapReduce phases are generic across different MapReduce jobs. The processing time mainly depends on the amount of data flowing through each phase and the performance of the underlying Hadoop cluster.

In order to enhance MapReduce performance, you first need to benchmark these different phases by running a set of different jobs with different amounts of data (per map/reduce tasks). Running these jobs is needed to collect measurements such as durations and data amount for each phase, and then analyze these measurements (for each of the phases) to derive the platform scaling functions.

To identify map-side bottlenecks, you should outline five phases of the map task execution flow. The following figure represents the map tasks' execution sequence:

Let us see what each...