Book Image

Instant MapReduce Patterns - Hadoop Essentials How-to

By : Liyanapathirannahelage H Perera
Book Image

Instant MapReduce Patterns - Hadoop Essentials How-to

By: Liyanapathirannahelage H Perera

Overview of this book

MapReduce is a technology that enables users to process large datasets and Hadoop is an implementation of MapReduce. We are beginning to see more and more data becoming available, and this hides many insights that might hold key to success or failure. However, MapReduce has the ability to analyze this data and write code to process it.Instant MapReduce Patterns – Hadoop Essentials How-to is a concise introduction to Hadoop and programming with MapReduce. It is aimed to get you started and give you an overall feel for programming with Hadoop so that you will have a well-grounded foundation to understand and solve all of your MapReduce problems as needed.Instant MapReduce Patterns – Hadoop Essentials How-to will start with the configuration of Hadoop before moving on to writing simple examples and discussing MapReduce programming patterns.We will start simply by installing Hadoop and writing a word count program. After which, we will deal with the seven styles of MapReduce programs: analytics, set operations, cross correlation, search, graph, Joins, and clustering. For each case, you will learn the pattern and create a representative example program. The book also provides you with additional pointers to further enhance your Hadoop skills.
Table of Contents (7 chapters)

Analytics – drawing a frequency distribution with MapReduce (Intermediate)

Often, we use Hadoop to calculate analytics, which are basic statistics about data. In such cases, we walk through the data using Hadoop and calculate interesting statistics about the data. Some of the common analytics are show as follows:

  • Calculating statistical properties like minimum, maximum, mean, median, standard deviation, and so on of a dataset. For a dataset, generally there are multiple dimensions (for example, when processing HTTP access logs, names of the web page, the size of the web page, access time, and so on, are few of the dimensions). We can measure the previously mentioned properties by using one or more dimensions. For example, we can group the data into multiple groups and calculate the mean value in each case.

  • Frequency distributions histogram counts the number of occurrences of each item in the dataset, sorts these frequencies, and plots different items as X axis and frequency as Y axis.

  • Finding a correlation between two dimensions (for example, correlation between access count and the file size of web accesses).

  • Hypothesis testing: To verify or disprove a hypothesis using a given dataset.

However, Hadoop will only generate numbers. Although the numbers contain all the information, we humans are very bad at figuring out overall trends by just looking at numbers. On the other hand, the human eye is remarkably good at detecting patterns, and plotting the data often yields us a deeper understanding of the data. Therefore, we often plot the results of Hadoop jobs using some plotting program.

This recipe will explain how to use MapReduce to calculate frequency distribution of the number of items brought by each customer. Then we will use gnuplot, a free and powerful, plotting program to plot results from the Hadoop job.

Getting ready

  1. This recipe assumes that you have access to a computer that has Java installed and the JAVA_HOME variable configured.

  2. Download a Hadoop distribution 1.1.x from page.

  3. Unzip the distribution, we will call this directory HADOOP_HOME.

  4. Download the sample code for the chapter and copy the data files as described in the Writing a word count application using Java (Simple) recipe.

How to do it...

  1. If you have not already done so, let us upload the amazon dataset to the HDFS filesystem using the following commands:

    >bin/hadoopdfs -mkdir /data/
    >bin/hadoopdfs -mkdir /data/amazon-dataset
    >bin/hadoopdfs -put <SAMPLE_DIR>/amazon-meta.txt /data/amazon-dataset/
    >bin/hadoopdfs -ls /data/amazon-dataset
  2. Copy the hadoop-microbook.jar file from SAMPLE_DIR to HADOOP_HOME.

  3. Run the first MapReduce job to calculate the buying frequency. To do that run the following command from HADOOP_HOME:

    $ bin/hadoop jar hadoop-microbook.jar  microbook.frequency.BuyingFrequencyAnalyzer/data/amazon-dataset /data/frequency-output1
  4. Use the following command to run the second MapReduce job to sort the results of the first MapReduce job:

    $ bin/hadoop jar hadoop-microbook.jar  microbook.frequency.SimpleResultSorter /data/frequency-output1 frequency-output2
  5. You can find the results from the output directory. Copy results to HADOOP_HOME using the following command:

    $ bin/Hadoop dfs -get /data/frequency-output2/part-r-00000
  6. Copy all the *.plot files from SAMPLE_DIR to HADOOP_HOME.

  7. Generate the plot by running the following command from HADOOP_HOME.

    $gnuplot buyfreq.plot
  8. It will generate a file called buyfreq.png, which will look like the following:

As the figure depicts, few buyers have brought a very large number of items. The distribution is much steeper than normal distribution, and often follows what we call a Power Law distribution. This is an example that analytics and plotting results would give us insight into, underlying patterns in the dataset.

How it works...

You can find the mapper and reducer code at src/microbook/frequency/

This figure shows the execution of two MapReduce jobs. Also the following code listing shows the map function and the reduce function of the first job:

public void map(Object key, Text value, Context context
                ) throwsIOException, InterruptedException {
    List<BuyerRecord> records =   
    for(BuyerRecord record: records){
        context.write(new Text(record.customerID), 
           new IntWritable(record.itemsBrought.size()));

public void reduce(Text key, Iterable<IntWritable> values, 
                   Context context) {
    int sum = 0;
    for (IntWritableval : values) {
        sum += val.get();
    context.write(key, result);

As shown by the figure, Hadoop will read the input file from the input folder and read records using the custom formatter we introduced in the Writing a formatter (Intermediate) recipe. It invokes the mapper once per each record, passing the record as input.

The mapper extracts the customer ID and the number of items the customer has brought, and emits the customer ID as the key and number of items as the value.

Then, Hadoop sorts the key-value pairs by the key and invokes a reducer once for each key passing all values for that key as inputs to the reducer. Each reducer sums up all item counts for each customer ID and emits the customer ID as the key and the count as the value in the results.

Then the second job sorted the results. It reads output of the first job as the result and passes each line as argument to the map function. The map function extracts the customer ID and the number of items from the line and emits the number of items as the key and the customer ID as the value. Hadoop will sort the key-value pairs by the key, thus sorting them by the number of items, and invokes the reducer once per key in the same order. Therefore, the reducer prints them out in the same order essentially sorting the dataset.

Since we have generated the results, let us look at the plotting. You can find the source for the gnuplot file from buyfreq.plot. The source for the plot will look like the following:

set terminal png
set output "buyfreq.png"

set title "Frequency Distribution of Items brought by Buyer";
setylabel "Number of Items Brought";
setxlabel "Buyers Sorted by Items count";
set key left top
set log y
set log x

plot "" using 2 title "Frequency" with linespoints

Here the first two lines define the output format. This example uses png, but gnuplot supports many other terminals such as screen, pdf, and eps. The next four lines define the axis labels and the title, and the next two lines define the scale of each axis, and this plot uses log scale for both.

The last line defines the plot. Here, it is asking gnuplot to read the data from the file, and to use the data in the second column of the file via using 2, and to plot it using lines. Columns must be separated by whitespaces.

Here if you want to plot one column against another, for example data from column 1 against column 2, you should write using 1:2 instead of using 2.

There's more...

We can use a similar method to calculate the most types of analytics and plot the results. Refer to the freely available article of Hadoop MapReduce Cookbook, Srinath Perera and Thilina Gunarathne, Packt Publishing at for more information.