Mastering Machine Learning with scikit-learn

Mastering Machine Learning with scikit-learn - Second Edition

By : Gavin Hackeling

Buy this Book

Mastering Machine Learning with scikit-learn - Second Edition

By: Gavin Hackeling

Buy this Book

Overview of this book

Machine learning is the buzzword bringing computer science and statistics together to build smart and efficient models. Using powerful algorithms and techniques offered by machine learning you can automate any analytical model. This book examines a variety of machine learning models including popular machine learning algorithms such as k-nearest neighbors, logistic regression, naive Bayes, k-means, decision trees, and artificial neural networks. It discusses data preprocessing, hyperparameter optimization, and ensemble methods. You will build systems that classify documents, recognize images, detect ads, and more. You will learn to use scikit-learn’s API to extract features from categorical variables, text and images; evaluate model performance, and develop an intuition for how to improve your model’s performance. By the end of this book, you will master all required concepts of scikit-learn to build efficient models at work to carry out advanced tasks with the practical approach.

Title Page

Credits

About the Author

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

The Fundamentals of Machine Learning

Defining machine learning

Learning from experience

Machine learning tasks

Training data, testing data, and validation data

Bias and variance

An introduction to scikit-learn

Installing scikit-learn

Installing pandas, Pillow, NLTK, and matplotlib

Summary

Simple Linear Regression

Simple linear regression

Evaluating the model

Summary

Classification and Regression with k-Nearest Neighbors

K-Nearest Neighbors

Lazy learning and non-parametric models

Classification with KNN

Regression with KNN

Summary

Feature Extraction

Extracting features from categorical variables

Standardizing features

Extracting features from text

Extracting features from images

Summary

From Simple Linear Regression to Multiple Linear Regression

Multiple linear regression

Polynomial regression

Regularization

Applying linear regression

Gradient descent

Summary

From Linear Regression to Logistic Regression

Binary classification with logistic regression

Spam filtering

Tuning models with grid search

Multi-class classification

Multi-label classification and problem transformation

Summary

Naive Bayes

Bayes' theorem

Generative and discriminative models

Naive Bayes

Naive Bayes with scikit-learn

Summary

Nonlinear Classification and Regression with Decision Trees

Decision trees

Training decision trees

Decision trees with scikit-learn

Summary

From Decision Trees to Random Forests and Other Ensemble Methods

Bagging

Boosting

Stacking

Summary

The Perceptron

The perceptron

Limitations of the perceptron

Summary

From the Perceptron to Support Vector Machines

Kernels and the kernel trick

Maximum margin classification and support vectors

Classifying characters in scikit-learn

Summary

From the Perceptron to Artificial Neural Networks

Nonlinear decision boundaries

Feed-forward and feedback ANNs

Multi-layer perceptrons

Training multi-layer perceptrons

Summary

K-means

Clustering

K-means

Evaluating clusters

Image quantization

Clustering to learn features

Summary

Dimensionality Reduction with Principal Component Analysis

Principal component analysis

Visualizing high-dimensional data with PCA

Face recognition with PCA

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Training data, testing data, and validation data

As mentioned previously, a training set is a collection of observations. These observations comprise the experience that the algorithm uses to learn. In supervised learning problems, each observation consists of an observed response variable and features of one or more observed explanatory variables. The test set is a similar collection of observations. The test set is used to evaluate the performance of the model using some performance metric. It is important that no observations from the training set are included in the test set. If the test set does contain examples from the training set, it will be difficult to assess whether the algorithm has learned to generalize from the training set or has simply memorized it. A program that generalizes well will be able to effectively perform a task with new data. In contrast, a program that memorizes the training data by learning an overly-complex model could predict the values of the response variable for the training set accurately, but will fail to predict the value of the response variable for new examples. Memorizing the training set is called overfitting. A program that memorizes its observations may not perform its task well, as it could memorize relations and structure that are coincidental in the training data. Balancing generalization and memorization is a problem common to many machine learning algorithms. In later chapters we will discuss regularization, which can be applied to many models to reduce over-fitting.

In addition to the training and test data, a third set of observations, called a validation or hold-out set, is sometimes required. The validation set is used to tune variables called hyperparameters that control how the algorithm learns from the training data. The program is still evaluated on the test set to provide an estimate of its performance in the real world. The validation set should not be used to estimate real-world performance because the program has been tuned to learn from the training data in a way that optimizes its score on the validation data; the program will not have this advantage in the real world.

It is common to partition a single set of supervised observations into training, validation, and test sets. There are no requirements for the sizes of the partitions, and they may vary according to the amount of data available. It is common to allocate between fifty and seventy-five percent of the data to the training set, ten to twenty-five percent of the data to the test set, and the remainder to the validation set.

Some training sets may contain only a few hundred observations; others may include millions. Inexpensive storage, increased network connectivity, and the ubiquity of sensor-packed smartphones have contributed to the contemporary state of big data, or training sets with millions or billions of examples. While this book will not work with datasets that require parallel processing on tens or hundreds of computers, the predictive power of many machine learning algorithms improves as the amount of training data increases. However, machine learning algorithms also follow the maxim "garbage in, garbage out". A student who studies for a test by reading a large, confusing textbook that contains many errors likely will not score better than a student who reads a short but well-written textbook. Similarly, an algorithm trained on a large collection of noisy, irrelevant, or incorrectly-labeled data will not perform better than an algorithm trained on a smaller set of data that is more representative of the problem in the real-world.

Many supervised training sets are prepared manually or by semi-automated processes. Creating a large collection of supervised data can be costly in some domains. Fortunately, several datasets are bundled with scikit-learn, allowing developers to focus on experimenting with models instead. During development, and particularly when training data is scarce, a practice called cross-validation can be used to train and validate a model on the same data. In cross-validation, the training data is partitioned. The model is trained using all but one of the partitions, and tested on the remaining partition. The partitions are then rotated several times so that the model is trained and evaluated on all of the data. The mean of the model's scores on each of the partitions is a better estimate of performance in the real world than an evaluation using a single training/testing split. The following diagram depicts cross validation with five partitions, or folds.

The original dataset is partitioned into five subsets of equal size labeled A through E. Initially the model is trained on partitions B through E, and tested on partition A. In the next iteration, the model is trained on partitions A, C, D, and E, and tested on partition B. The partitions are rotated until models have been trained and tested on all of the partitions. Cross-validation provides a more accurate estimate of the model's performance than testing a single partition of the data.

Mastering Machine Learning with scikit-learn - Second Edition

By : Gavin Hackeling

Mastering Machine Learning with scikit-learn - Second Edition

By: Gavin Hackeling

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Machine Learning with scikit-learn - Second Edition

Python Machine Learning, Second Edition

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

Machine Learning with PyTorch and Scikit-Learn

Training data, testing data, and validation data