Often, the datasets contain duplicate items that need to be eliminated to ensure the accuracy of the results. In this recipe, we use Hadoop to remove the duplicate mail records in the 20news dataset. These duplicate records are due to the users cross-posting the same message to multiple newsboards.
The following steps show how to remove duplicate mails due to cross-posting across the lists, from the 20news dataset:
Download and extract the 20news dataset from http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz:
$ wget http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz $ tar –xzf 20news-19997.tar.gz
Upload the extracted data to the HDFS. In order to save the compute time and resources, you can use only a subset of the dataset:
$ hdfs dfs -mkdir 20news-all $ hdfs dfs –put <extracted_folder> 20news-all
We are going to use the
MailPreProcessor.py
Python...