One of the main advantages of using Hadoop MapReduce is the framework-managed fault tolerance. When performing a large-scale distributed computation, parts of the computation can fail due to external causes such as network failures, disk failures, and node failures. When Hadoop detects an unresponsive task or a failed task, Hadoop will re-execute those tasks in a new node.
A Hadoop cluster may consist of heterogeneous nodes, and as a result there can be very slow nodes as well as fast nodes. Potentially, a few slow nodes and the tasks executing on those nodes can dominate the execution time of a computation. Hadoop introduces speculative execution optimization to avoid these slow-running tasks, which are called stragglers.
When most of the Map (or Reduce) tasks of a computation are completed, the Hadoop speculative execution feature will schedule duplicate executions of the remaining slow tasks in available alternate nodes. The slowness of a task is...