Using MapReduce with Disco
Disco is a Python module based on the MapReduce framework introduced by Google, which allows the management of large distributed data in computer clusters. The applications written using Disco can be performed in the economic cluster of machines with a very short learning curve. In fact, the technical difficulties related to the processes that are distributed as load balancing, job scheduling, and the communications protocol are completely managed by Disco and hidden from the developer.
The typical applications of this module are essentially as follows:
Web indexing
URL access counter
Distributed sort
The MapReduce algorithm implemented in Disco is as follows:
Map: The master node takes the input data, breaks it into smaller subtasks, and distributes the work to the slave nodes. The single map node produces the intermediate result of the
map()
function in the form of pairs[key, value]
stored on a distributed file whose location is given to the master...