Book Image

Optimizing Hadoop for MapReduce

By : Khaled Tannir
Book Image

Optimizing Hadoop for MapReduce

By: Khaled Tannir

Overview of this book

Table of Contents (15 chapters)

The MapReduce model


MapReduce is a programming model designed for processing unstructured data by large clusters of commodity hardware and generating large datasets. It is capable of processing many terabytes of data on thousands of computing nodes in a cluster, handling failures, duplicating tasks, and aggregating results.

The MapReduce model is simple to understand. It was designed in the early 2000s by the engineers at Google Research (http://research.google.com/archive/mapreduce.html). It consists of two functions, a map function and a reduce function that can be executed in parallel on multiple machines.

To use MapReduce, the programmer writes a user-defined map function and a user-defined reduce function that expresses their desired computation. The map function reads a key/value pair, applies the user specific code, and produces results called intermediate results. Then, these intermediate results are aggregated by the reduce user-specific code that outputs the final results.

Input to a MapReduce application is organized in the records as per the input specification that will yield key/value pairs, each of which is a <k1, v1> pair.

Therefore, the MapReduce process consists of two main phases:

  • map(): The user-defined map function is applied to all input records one by one, and for each record it outputs a list of zero or more intermediate key/value pairs, that is, <k2, v2> records. Then all <k2, v2> records are collected and reorganized so that records with the same keys (k2) are put together into a <k2, list(v2)> record.

  • reduce(): The user-defined reduce function is called once for each distinct key in the map output, <k2, list(v2)> records, and for each record the reduce function outputs zero or more <k2, v3> pairs. All <k2, v3> pairs together coalesce into the final result.

    Tip

    The signatures of the map and reduce functions are as follows:

    • map(<k1, v1>) list(<k2, v2>)

    • reduce(<k2, list(v2)>) <k2, v3>

The MapReduce programming model is designed to be independent of storage systems. MapReduce reads key/value pairs from the underlying storage system through a reader. The reader retrieves each record from the storage system and wraps the record into a key/value pair for further processing. Users can add support for a new storage system by implementing a corresponding reader. This storage-independent design is considered to be beneficial for heterogeneous systems since it enables MapReduce to analyze data stored in different storage systems.

To understand the MapReduce programming model, let's assume you want to count the number of occurrences of each word in a given input file. Translated into a MapReduce job, the word-count job is defined by the following steps:

  1. The input data is split into records.

  2. Map functions process these records and produce key/value pairs for each word.

  3. All key/value pairs that are output by the map function are merged together, grouped by a key, and sorted.

  4. The intermediate results are transmitted to the reduce function, which will produce the final output.

The overall steps of this MapReduce application are represented in the following diagram:

While aggregating key/value pairs, a massive amount of I/O and network traffic I/O can be observed. To reduce the amount of network traffic required between the map and reduce steps, the programmer can optionally perform a map-side pre-aggregation by supplying a Combiner function. Combiner functions are similar to the reduce function, except that they are not passed all the values for a given key; instead, a Combiner function emits an output value that summarizes the input values it was passed.