This is basically a prebuilt utility that comes along with the Hadoop distribution. It allows you to create a MapReduce job using any executable program or script as the mapper and reducer.
As discussed in Chapter 5, Programming Hadoop on Amazon EMR, let's say you have your local copy of Hadoop distribution in <hadoop-2.2.0-base-path>
. You should be able to find the streaming utility jar file in <hadoop-2.2.0-base-path>/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar
. Say you have written your mapper and reducer in Python, and you have mapper.py
and reducer.py
as your mapper and reducer respectively. Now, locally you can use the streaming utility by executing the following command:
<hadoop-2.2.0-base-path>/bin/hadoop jar <hadoop-2.2.0-base-path>/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \ -input <inputDirectoryOrFile> \ -output <outputDirectory> \ -mapper mapper.py \ -reducer reducer.py