Apache Hadoop MapReduce is the most popular implementation of the MapReduce programming paradigm. Coupled with a distributed storage framework in the form of HDFS, it provides a very robust system for processing of large datasets over a cluster of hundreds or even thousands of nodes.
The Hadoop MapReduce project can be broken down into the following three major components:
The MapReduce API: This includes the set of libraries available for the end users to create their applications. You will use these to create the
map
andreduce
functions to be executed by the framework. The APIs also have provisions to set various configurations for the cluster and its components.The MapReduce framework: This is the runtime implementation of various phases involved in the execution of a MapReduce task, which includes the map phase, sort/shuffle/merge phase, and the reduce phase. The intricacies of the data flow throughout various stages form the major part of this component.
The...