We will use the SMS Spam Collection dataset from the UCI ML repository to create a spam classifier. Using the spam classifier, we can estimate the polarity of these messages. We can use various classifiers to classify the messages either as spam or ham.
In this example, we opt for algorithms such as Naive Bayes, random forest, and support vector machines to train our models.
We prepare our data using various data-cleaning and preparation mechanisms. To preprocess our data, we will perform the following sequence:
- Convert all text to lowercase
- Remove punctuation
- Remove stop words
- Perform stemming
- Tokenize the data
We also process our data using term frequency-inverse data frequency (TF-IDF), which tells us how often a word appears in a message or a document. TF is calculated as:
TF = No. of times a word appears in a...