Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Overview of this book

Table of Contents (19 chapters)
Hadoop MapReduce v2 Cookbook Second Edition
Credits
About the Author
Acknowledgments
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

De-duplicating data using Hadoop streaming


Often, the datasets contain duplicate items that need to be eliminated to ensure the accuracy of the results. In this recipe, we use Hadoop to remove the duplicate mail records in the 20news dataset. These duplicate records are due to the users cross-posting the same message to multiple newsboards.

Getting ready

  • Make sure Python is installed on your Hadoop compute nodes.

How to do it...

The following steps show how to remove duplicate mails due to cross-posting across the lists, from the 20news dataset:

  1. Download and extract the 20news dataset from http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz:

    $ wget http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz
    $ tar –xzf 20news-19997.tar.gz
    
  2. Upload the extracted data to the HDFS. In order to save the compute time and resources, you can use only a subset of the dataset:

    $ hdfs dfs -mkdir 20news-all
    $ hdfs dfs –put  <extracted_folder> 20news-all
    
  3. We are going to use the MailPreProcessor.py Python...