Book Image

Optimizing Hadoop for MapReduce

By : Khaled Tannir
Book Image

Optimizing Hadoop for MapReduce

By: Khaled Tannir

Overview of this book

Table of Contents (15 chapters)

Enhancing reduce tasks


Reduce task processing consists of a sequence of three phases. Only the execution of the user-defined reduce function is custom, and its duration depends on the amount of data flowing through each phase and the performance of the underlying Hadoop cluster. Profiling each of these phases will help you to identify potential bottlenecks and low speeds of data processing. The following figure shows the three major phases of Reduce tasks:

Let's see each phase in some detail:

  • Profiling the Shuffle phase implies that you need to measure the time taken to transfer the intermediate data from map tasks to the reduce tasks and then merge and sort them together. In the shuffle phase, the intermediate data generated by the map phase is fetched. The processing time of this phase significantly depends on Hadoop configuration parameters and the amount of intermediate data that is destined for the reduce task.

  • In the Reduce phase, each reduce task is assigned a partition of the map output...