Hadoop streaming allows us to use any executable or a script as the Mapper or the Reducer of a Hadoop MapReduce job. Hadoop streaming enables us to perform rapid prototyping of the MapReduce computations using Linux shell utility programs or using scripting languages. Hadoop streaming also allows the users with some or no Java knowledge to utilize Hadoop to process data stored in HDFS.
In this recipe, we implement a Mapper for our HTTP log processing application using Python and use a Hadoop aggregate-package-based Reducer.
The following are the steps to use a Python program as the Mapper to process the HTTP server log files:
Write the
logProcessor.py
python script:#!/usr/bin/python import sys import re def main(argv): regex =re.compile('……') line = sys.stdin.readline() try: while line: fields = regex.match(line) if(fields!=None): print"LongValueSum:"+fields.group(1)+ "\t"+fields...