Book Image

Instant MapReduce Patterns - Hadoop Essentials How-to

By : Liyanapathirannahelage H Perera
Book Image

Instant MapReduce Patterns - Hadoop Essentials How-to

By: Liyanapathirannahelage H Perera

Overview of this book

MapReduce is a technology that enables users to process large datasets and Hadoop is an implementation of MapReduce. We are beginning to see more and more data becoming available, and this hides many insights that might hold key to success or failure. However, MapReduce has the ability to analyze this data and write code to process it.Instant MapReduce Patterns – Hadoop Essentials How-to is a concise introduction to Hadoop and programming with MapReduce. It is aimed to get you started and give you an overall feel for programming with Hadoop so that you will have a well-grounded foundation to understand and solve all of your MapReduce problems as needed.Instant MapReduce Patterns – Hadoop Essentials How-to will start with the configuration of Hadoop before moving on to writing simple examples and discussing MapReduce programming patterns.We will start simply by installing Hadoop and writing a word count program. After which, we will deal with the seven styles of MapReduce programs: analytics, set operations, cross correlation, search, graph, Joins, and clustering. For each case, you will learn the pattern and create a representative example program. The book also provides you with additional pointers to further enhance your Hadoop skills.
Table of Contents (7 chapters)

Simple search with MapReduce (Intermediate)


Text search is one of the first use cases for MapReduce, and according to Google, they built MapReduce as the programming model for text processing related to their search platform.

Search is generally implemented with an inverted index. An inverted index is a mapping of words to the data items that includes that word. Given a search query, we find all documents that have the words in the query. One of the complexities of web search is that there are too many results and we only need to show important queries. However, ranking the documents based on their importance is out of the scope of this discussion.

This recipe explains how to build a simple inverted index based search using MapReduce.

Getting ready

  1. This assumes that you have installed Hadoop and started it. Writing a word count application using Java (Simple) and Installing Hadoop in a distributed setup and running a word count application (Simple) recipes for more information. We will use HADOOP_HOME to refer to the Hadoop installation directory.

  2. This recipe assumes you are aware of how Hadoop processing works. If you have not already done so, you should follow the Writing a word count application with MapReduce and running it (Simple) recipe.

  3. Download the sample code for the chapter and download the data files as described in the Writing a word count application with MapReduce and running it (Simple) recipe. Select a subset of data from the Amazon dataset if you are running this with few computers. You can find the smaller dataset with sample directory.

How to do it...

  1. If you have not already done so, let us upload the Amazon dataset to the HDFS filesystem using the following commands from HADOOP_HOME:

    > bin/hadoop dfs -mkdir /data/
    > bin/hadoop dfs -mkdir /data/amazon-dataset
    > bin/hadoop dfs -put <DATA_DIR>/amazon-meta.txt /data/amazon-dataset/
    
  2. Copy the hadoop-microbook.jar file from SAMPLE_DIR to HADOOP_HOME.

  3. Run the MapReduce job to calculate the buying frequency. To do that run the following command from HADOOP_HOME:

    $ bin/hadoop jar hadoop-microbook.jar  microbook.search.TitleInvertedIndexGenerator /data/amazon-dataset /data/search-output
    
  4. Your can find the results from the output directory, /data/search-output.

  5. Run the following commands to download the results file from the server and to search for the word Cycling using the index built by the MapReduce job. It should print the items that have the word Cycling in the title.

    $ bin/hadoop dfs -get /data/search-output/part-r-00000 invetedIndex.data
    
    $ java –cp hadoop-microbook.jar microbook.search. IndexBasedTitleSearch invetedIndex.data Cycling
    

How it works...

You can find the mapper and reducer code at src/microbook/search/TitleInvertedIndexGenerator.java.

The preceding figure shows the execution of two MapReduce job. Also the following code listing shows the map function and the reduce function of the first job.

As shown by the figure, Hadoop will read the input file from the input folder and read records using the custom formatter we introduced in the Write a formatter (Intermediate) recipe. It invokes the mapper once per each record passing the record as input.

The map function reads the title of the item from the record, tokenizes it, and emits each word in the title as the key and the item ID as the value.

public void map(Object key, Text value, Context context) {
    List<BuyerRecord> records = 
BuyerRecord.parseAItemLine(value.toString());
    for (BuyerRecord record : records) {
        for(ItemData item: record.itemsBrought){
            StringTokenizer itr = 
new StringTokenizer(item.title);
            while (itr.hasMoreTokens()) {
                String token = 
itr.nextToken().replaceAll("[^A-z0-9]", "");
                if(token.length() > 0){
                    context.write(new Text(token), 
                    new Text(
                    pad(String.valueOf(item.salesrank))
                    + "#" +item.itemID));    
                }
            }
        }
    }
}

MapReduce will sort the key-value pairs by the key and invoke the reducer once for each unique key, passing the list of values emitted against that key as the input.

Each reducer will receive a word as the key and list of item IDs as the values, and it will emit them as it is. The output is an inverted index.

public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
        InterruptedException {
    TreeSet<String> set = new TreeSet<String>();
    for (Text valtemp : values) {
        set.add(valtemp.toString());
    }

    StringBuffer buf = new StringBuffer();
    for (String val : set) {
        buf.append(val).append(",");
    }
    context.write(key, new Text(buf.toString()));
}

The following listing gives the code for the search program. The search program loads the inverted index to the memory, and when we search for a word, it will find the item IDs against that word, and list them.

String line = br.readLine();
while (line != null) {
    Matcher matcher = parsingPattern.matcher(line);
    if (matcher.find()) {
        String key = matcher.group(1);
        String value = matcher.group(2);

        String[] tokens = value.split(",");
        invertedIndex.put(key, tokens);
        line = br.readLine();
    }
}

String searchQuery = "Cycling";
String[] tokens = invertedIndex.get(searchQuery);
if (tokens != null) {
    for (String token : tokens) {
        System.out.println(Arrays.toString(token.split("#")));
        System.out.println(token.split("#")[1]);
    }
}

There's more...

We use indexes to quickly find data from a large dataset. The same pattern is very useful for building indexes to support fast searches.