-
Book Overview & Buying
-
Table Of Contents
Practical Data Analysis - Second Edition
By :
We can find the spam dataset from the following link:
http://spamassassin.apache.org/
In the following screenshot, we can see the easy ham (not spam) folder with 2551 files:

The spam text looks like the following screenshot, which may include HTML tags and plain text. In this case, we are only interested in the subject line, so we need to write a code to obtain the subject from all the files.

This example will show you how to preprocess the SpamAssassin data using Python in order to collect all the subject lines from the e-mails.
First, we need to import the os module in order to get the list of file names using the listdir function from the " \spam" and " \easy_ham" folders:
import os files = os.listdir(r" \spam")
Now we need a new file to store the subject lines and the category (spam or not spam); this time, we will use a comma as a separator:
with open("SubjectsSpam.out","a") as out:
category = "spam"
Now we will parse each file and get the subject. Finally, we write...
Change the font size
Change margin width
Change background colour