There are a number of concepts to perform data mining and general computation on big data. One of the most popular is the MapReduce model, which can be used for general computation on arbitrarily large datasets.
MapReduce originates from Google, where it was developed with distributed computing in mind. It also introduces fault tolerance and scalability improvements. The original research for MapReduce was published in 2004, and since then there have been thousands of projects, implementations, and applications using it.
While the concept is similar to many previous concepts, MapReduce has become a staple in big data analytics.
There are two major stages in a MapReduce job.
- The first is Map, by which we take a function and a list of items, and apply that function to each item. Put another way, we take each item as the input to the function and store the result of that function call:
- The second step is Reduce, where we take the results from the map step and combine them using a function...