Book Image

Practical Data Analysis - Second Edition

By : Hector Cuesta, Dr. Sampath Kumar
Book Image

Practical Data Analysis - Second Edition

By: Hector Cuesta, Dr. Sampath Kumar

Overview of this book

Beyond buzzwords like Big Data or Data Science, there are a great opportunities to innovate in many businesses using data analysis to get data-driven products. Data analysis involves asking many questions about data in order to discover insights and generate value for a product or a service. This book explains the basic data algorithms without the theoretical jargon, and you’ll get hands-on turning data into insights using machine learning techniques. We will perform data-driven innovation processing for several types of data such as text, Images, social network graphs, documents, and time series, showing you how to implement large data processing with MongoDB and Apache Spark.
Table of Contents (21 chapters)
Practical Data Analysis - Second Edition
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface

The data


We can find the spam dataset from the following link:

http://spamassassin.apache.org/

In the following screenshot, we can see the easy ham (not spam) folder with 2551 files:

The spam text looks like the following screenshot, which may include HTML tags and plain text. In this case, we are only interested in the subject line, so we need to write a code to obtain the subject from all the files.

This example will show you how to preprocess the SpamAssassin data using Python in order to collect all the subject lines from the e-mails.

First, we need to import the os module in order to get the list of file names using the listdir function from the " \spam" and " \easy_ham" folders:

import os 
files = os.listdir(r" \spam") 

Now we need a new file to store the subject lines and the category (spam or not spam); this time, we will use a comma as a separator:

with open("SubjectsSpam.out","a") as out: 
     category = "spam" 

Now we will parse each file and get the subject. Finally, we write...