The final task of this chapter will be to apply our newly gained skills to a real spam filter! This task deals with solving a binary-class (spam/ham) classification problem using the Naive Bayes algorithm.
Naive Bayes classifiers are actually a very popular model for email filtering. Their naivety lends itself nicely to the analysis of text data, where each feature is a word (or a bag of words), and it would not be feasible to model the dependence of every word on every other word.
There are a bunch of good email datasets out there, such as the following:
- The Hewlett-Packard spam database: https://archive.ics.uci.edu/ml/machine-learning-databases/spambase
- The Enrom-Spam dataset: http://www.aueb.gr/users/ion/data/enron-spam
In this section, we will be using the Enrom-Spam dataset, which can be downloaded for free from the given...