Book Image

Instant MapReduce Patterns - Hadoop Essentials How-to

By : Liyanapathirannahelage H Perera
Book Image

Instant MapReduce Patterns - Hadoop Essentials How-to

By: Liyanapathirannahelage H Perera

Overview of this book

MapReduce is a technology that enables users to process large datasets and Hadoop is an implementation of MapReduce. We are beginning to see more and more data becoming available, and this hides many insights that might hold key to success or failure. However, MapReduce has the ability to analyze this data and write code to process it.Instant MapReduce Patterns – Hadoop Essentials How-to is a concise introduction to Hadoop and programming with MapReduce. It is aimed to get you started and give you an overall feel for programming with Hadoop so that you will have a well-grounded foundation to understand and solve all of your MapReduce problems as needed.Instant MapReduce Patterns – Hadoop Essentials How-to will start with the configuration of Hadoop before moving on to writing simple examples and discussing MapReduce programming patterns.We will start simply by installing Hadoop and writing a word count program. After which, we will deal with the seven styles of MapReduce programs: analytics, set operations, cross correlation, search, graph, Joins, and clustering. For each case, you will learn the pattern and create a representative example program. The book also provides you with additional pointers to further enhance your Hadoop skills.
Table of Contents (7 chapters)

Simple graph operations with MapReduce (Advanced)


Graphs are another type of data that we often encounter. One of the primary use cases for graphs is social networking; people want to search graphs for interesting patterns. This recipe explains how to perform a simple graph operation, graph traversal, using MapReduce.

This recipe uses the results from the Cross correlation with MapReduce (Intermediate) recipe. Each buyer is a node, and if two buyers have bought the same item, there is an edge between these nodes.

A sample input is shown as follows:

AR1T36GLLAFFX A26TSW6AI59ZCV,A39LRCAB9G8F21,ABT9YLRGT4ISP|Gray

Here the first token is node, and the comma-separated values are lists of nodes to which the first node has an edge. The last value is the color of the node. This is a construct we use for the graph traversal algorithm.

Given a buyer (a node), this recipe walks though the graph and calculates the distance from the given node to all other nodes.

This recipe and the next recipe belong to a class called iterative MapReduce where we cannot solve the problem by processing data once. Iterative MapReduce processes the data many times using a MapReduce job until we have calculated the distance from the given node to all other nodes.

Getting ready

  1. This assumes that you have installed Hadoop and started it. Writing a word count application using Java (Simple) and Installing Hadoop in a distributed setup and running a word count application (Simple) recipes for more information. We will use HADOOP_HOME to refer to the Hadoop a word count application (Simple) installation directory.

  2. This recipe assumes you are aware of how Hadoop processing works. If you have not already done so, you should follow the Writing a word count application with MapReduce and running it (Simple) recipe.

  3. Download the sample code for the chapter and download the data files as described in the Writing a word count application with MapReduce and running it (Simple) recipe. Select a subset of data from the Amazon dataset if you are running this with few computers. You can find the smaller dataset with sample directory.

How to do it...

  1. Change directory to HADOOP_HOME and copy the hadoop-microbook.jar file from SAMPLE_DIR to HADOOP_HOME.

  2. Upload the Amazon dataset to the HDFS filesystem using the following commands from HADOOP_HOME, if you have not already done so:

    > bin/hadoop dfs -mkdir /data/
    > bin/hadoop dfs -mkdir /data/amazon-dataset
    > bin/hadoop dfs -put <DATA_DIR>/amazon-meta.txt /data/amazon-dataset/
    > bin/hadoop dfs -ls /data/amazon-dataset
    
  3. Run the following command to generate the graph:

    > bin/hadoop jar hadoop-microbook.jar  microbook.graph.GraphGenerator /data/amazon-dataset /data/graph-output1
    
  4. Run the following command to run MapReduce job to calculate the graph distance:

    $ bin/hadoop jar hadoop-microbook.jar  microbook.graph.SimilarItemsFinder /data/graph-output1 /data/graph-output2
    
  5. You can find the results at /data/graph-output2. It will have results for all iterations, and you should look at the last iteration.

How it works...

You can find the mapper and reducer code at src/microbook/SimilarItemsFinder.java.

The preceding figure shows the execution of two MapReduce job and the driver code. The driver code repeats the map reduce job until the graph traversal is complete.

The algorithm operates by coloring the graph nodes. Each node is colored white at the start, except for the node where we start the traversal, which is marked gray. When we generate the graph, the code will mark that node as gray. If you need to change the starting node, you can do so by editing the graph.

As shown in the figure, at each step, the MapReduce job processes the nodes that are marked gray and calculates the distance to the nodes that are connected to the gray nodes via an edge, and updates the distance. Furthermore, the algorithm will also mark those adjacent nodes as gray if their current color is white. Finally, after visiting and marking all its children gray, we set the node color as black. At the next step, we will visit those nodes marked with the color gray. It continues this until we have visited all the nodes.

Also the following code listing shows the map function and the reduce function of the MapReduce job.

public void map(Object key, Text value, Context context){
    Matcher matcher = parsingPattern.matcher(value.toString());
    if (matcher.find()) {
        String id = matcher.group(1);
        String val = matcher.group(2);

        GNode node = new GNode(id, val); 
        if(node.color == Color.Gray){
            node.color = Color.Black;
            context.write(new Text(id), 
            new Text(node.toString()));
            for(String e: node.edges){
                GNode nNode = new GNode(e, (String[])null);
                nNode.minDistance = node.minDistance+1;
                nNode.color = Color.Red;
                context.write(new Text(e), 
                new Text(nNode.toString()));
            }
        }else{
            context.write(new Text(id), new Text(val)); 
        }
    } else {
        System.out.println("Unprocessed Line " + value);
    }
}

As shown by the figure, Hadoop will read the input file from the input folder and read records using the custom formatter we introduced in the Write a formatter (Intermediate) recipe. It invokes the mapper once per each record passing the record as input.

Each record includes the node. If the node is not colored gray, the mapper will emit the node without any change using the node ID as the key.

If the node is colored gray, the mapper explores all the edges connected to the node, updates the distance to be the current node distance +1. Then it emits the node ID as the key and distance as the value to the reducer.

public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
        InterruptedException {
    GNode originalNode =  null; 
    boolean hasRedNodes = false;
    int minDistance = Integer.MAX_VALUE;
    for(Text val: values){
        GNode node = new GNode(key.toString(),val.toString());
        if(node.color == Color.Black || 
node.color == Color.White){
            originalNode = node;
        }else if(node.color == Color.Red){
            hasRedNodes = true;
        }
        if(minDistance > node.minDistance){
            minDistance = node.minDistance; 
        }
    }
    if(originalNode != null){
        originalNode.minDistance = minDistance;
        if(originalNode.color == Color.White && hasRedNodes){
            originalNode.color = Color.Gray;
        }
        context.write(key, new Text(originalNode.toString()));
    }
}

MapReduce will sort the key-value pairs by the key and invoke the reducer once for each unique key passing the list of values emitted against that key as the input.

Each reducer will receive a key-value pairs information about nodes and distances as calculated by the mapper when it encounters the node. The reducer updates the distance in the node if the distance updates are less than the current distance of the node. Then, it emits the node ID as the key and node information as the value.

The driver repeats the process until all the nodes are marked black and the distance is updated. When starting, we will have only one node marked as gray and all others as white. At each execution, the MapReduce job will mark the nodes connected to the first node as gray and update the distances. It will mark the visited node as black.

We continue this until all nodes are marked as black and have updated distances.

There's more...

Users can use the iterative MapReduce-based solution we discussed in this recipe with many graph algorithms such as graph search.