Optimizing MapReduce code-side performance in detail exceeds the scope of this book. In this section, we will provide a basic guideline with some rules to contribute to the improvement of your MapReduce job performance.
One of the important features of Hadoop is that all data is processed in a unit known as records. While records have almost the same size, theoretically, the time to process such records should be the same. However, in practice, the processing time of records within a task vary significantly and slowness may appear when reading a record from memory, processing the record, or writing the record to memory. Moreover, in practice, two other factors may affect the mapper or reducer performance: I/O access time and spill, and overhead waiting time resulting from heavy I/O requests.
MapReduce provides ease of use while a programmer defines his job with only...