Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Overview of this book

Table of Contents (19 chapters)
Hadoop MapReduce v2 Cookbook Second Edition
Credits
About the Author
Acknowledgments
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Using Hadoop with legacy applications – Hadoop streaming


Hadoop streaming allows us to use any executable or a script as the Mapper or the Reducer of a Hadoop MapReduce job. Hadoop streaming enables us to perform rapid prototyping of the MapReduce computations using Linux shell utility programs or using scripting languages. Hadoop streaming also allows the users with some or no Java knowledge to utilize Hadoop to process data stored in HDFS.

In this recipe, we implement a Mapper for our HTTP log processing application using Python and use a Hadoop aggregate-package-based Reducer.

How to do it...

The following are the steps to use a Python program as the Mapper to process the HTTP server log files:

  1. Write the logProcessor.py python script:

    #!/usr/bin/python
    import sys
    import re
    def main(argv):
      regex =re.compile('……')
      line = sys.stdin.readline()
      try:
        while line:
          fields = regex.match(line)
          if(fields!=None):
            print"LongValueSum:"+fields.group(1)+
              "\t"+fields...