In the previous recipe, we implemented a simple MapReduce job using the Java API of Hadoop. The use case was the same as the one in the recipes of Chapter 3, Programming Language Drivers, where we saw MapReduce implemented using Mongo client APIs in Python and Java. In this recipe, we will use Hadoop streaming to implement MapReduce jobs.
The concept of streaming works based on communication using stdin and stdout. Get more information on what Hadoop streaming is and how it works at http://hadoop.apache.org/docs/r1.2.1/streaming.html.
Refer to the Executing our first sample MapReduce job using the mongo-hadoop connector recipe to see how to set up Hadoop for development purposes and build the mongo-hadoop
project using gradle
. As far as Python libraries are concerned, we will install the required library from source. However, you can use pip
to carry out the setup if you do not wish to build from source. We will also see how to...