Our first problem is a modern version of the canonical binary classification problem: spam classification. In our version, however, we will classify spam and ham SMS messages rather than e-mail. We will extract TF-IDF features from the messages using techniques you learned in Chapter 3, Feature Extraction and Preprocessing, and classify the messages using logistic regression.
We will use the SMS Spam Classification Data Set from the UCI Machine Learning Repository. The dataset can be downloaded from http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. First, let's explore the data set and calculate some basic summary statistics using pandas:
>>> import pandas as pd >>> df = pd.read_csv('data/SMSSpamCollection', delimiter='\t', header=None) >>> print df.head() 0 1 0 ham Go until jurong point, crazy.. Available only ... 1 ham Ok lar... Joking wif u oni... 2 spam...