Book Image

Learning Hadoop 2

Book Image

Learning Hadoop 2

Overview of this book

Table of Contents (18 chapters)
Learning Hadoop 2
About the Authors
About the Reviewers

Writing MapReduce programs

In this chapter, we will be focusing on batch workloads; given a set of historical data, we will look at properties of that dataset. In Chapter 4, Real-time Computation with Samza, and Chapter 5, Iterative Computation with Spark, we will show how a similar type of analysis can be performed over a stream of text collected in real time.

Getting started

In the following examples, we will assume a dataset generated by collecting 1,000 tweets using the script, as shown in Chapter 1, Introduction:

$ python –t –n 1000 > tweets.txt

We can then copy the dataset into HDFS with:

$ hdfs dfs -put tweets.txt <destination>


Note that until now we have been working only with the text of tweets. In the remainder of this book, we'll extend to output additional tweet metadata in JSON format. Keep this in mind before dumping terabytes of messages with

Our first MapReduce program will be the canonical WordCount example. A variation of...