Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Overview of this book

Table of Contents (19 chapters)
Hadoop MapReduce v2 Cookbook Second Edition
Credits
About the Author
Acknowledgments
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Hadoop intermediate data partitioning


Hadoop MapReduce partitions the intermediate data generated by the Map tasks across the Reduce tasks of the computations. A proper partitioning function ensuring balanced load for each Reduce task is crucial to the performance of MapReduce computations. Partitioning can also be used to group together related sets of records to specific reduce tasks, where you want certain outputs to be processed or grouped together. The figure in the Introduction section of this chapter depicts where the partitioning fits into the overall MapReduce computation flow.

Hadoop partitions the intermediate data based on the key space of the intermediate data and decides which Reduce task will receive which intermediate record. The sorted set of keys and their values of a partition would be the input for a Reduce task. In Hadoop, the total number of partitions should be equal to the number of Reduce tasks for the MapReduce computation. Hadoop partitioners should extend the org...