Book Image

Mastering Machine Learning with scikit-learn

By : Gavin Hackeling
Book Image

Mastering Machine Learning with scikit-learn

By: Gavin Hackeling

Overview of this book

<p>This book examines machine learning models including logistic regression, decision trees, and support vector machines, and applies them to common problems such as categorizing documents and classifying images. It begins with the fundamentals of machine learning, introducing you to the supervised-unsupervised spectrum, the uses of training and test data, and evaluating models. You will learn how to use generalized linear models in regression problems, as well as solve problems with text and categorical features.</p> <p>You will be acquainted with the use of logistic regression, regularization, and the various loss functions that are used by generalized linear models. The book will also walk you through an example project that prompts you to label the most uncertain training examples. You will also use an unsupervised Hidden Markov Model to predict stock prices.</p> <p>By the end of the book, you will be an expert in scikit-learn and will be well versed in machine learning.</p>
Table of Contents (17 chapters)
Mastering Machine Learning with scikit-learn
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Spam filtering


Our first problem is a modern version of the canonical binary classification problem: spam classification. In our version, however, we will classify spam and ham SMS messages rather than e-mail. We will extract TF-IDF features from the messages using techniques you learned in Chapter 3, Feature Extraction and Preprocessing, and classify the messages using logistic regression.

We will use the SMS Spam Classification Data Set from the UCI Machine Learning Repository. The dataset can be downloaded from http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. First, let's explore the data set and calculate some basic summary statistics using pandas:

>>> import pandas as pd
>>> df = pd.read_csv('data/SMSSpamCollection', delimiter='\t', header=None)
>>> print df.head()

      0                                                  1
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam...