Book Image

Mastering Hadoop

By : Sandeep Karanth
Book Image

Mastering Hadoop

By: Sandeep Karanth

Overview of this book

Table of Contents (21 chapters)
Mastering Hadoop
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

The Reduce task


The Reduce task is an aggregation step. If the number of Reduce tasks is not specified, the default number is one. The risk of running one Reduce task would mean overloading that particular node. Having too many Reduce tasks would mean shuffle complexity and proliferation of output files that puts pressure on the NameNode. It is important to understand the data distribution and the partitioning function to decide the optimal number of Reduce tasks.

Tip

The ideal setting for each Reduce task to process is a range of 1 GB to 5 GB.

The number of Reduce tasks can be set using the mapreduce.job.reduces parameter. It can be programmatically set by calling the setNumReduceTasks() method on the Job object. There is a cap on the number of Reduce tasks that can be executed by a single node. It is given by the mapreduce.tasktracker.reduce.maximum property.

Note

The heuristic to determine the right number of reducers is as follows:

0.95 * (nodes * mapreduce.tasktracker.reduce.maximum)

Alternatively...