Mastering Machine Learning with scikit-learn

Mastering Machine Learning with scikit-learn - Second Edition

By : Gavin Hackeling

Buy this Book

Mastering Machine Learning with scikit-learn - Second Edition

By: Gavin Hackeling

Buy this Book

Overview of this book

Machine learning is the buzzword bringing computer science and statistics together to build smart and efficient models. Using powerful algorithms and techniques offered by machine learning you can automate any analytical model. This book examines a variety of machine learning models including popular machine learning algorithms such as k-nearest neighbors, logistic regression, naive Bayes, k-means, decision trees, and artificial neural networks. It discusses data preprocessing, hyperparameter optimization, and ensemble methods. You will build systems that classify documents, recognize images, detect ads, and more. You will learn to use scikit-learn’s API to extract features from categorical variables, text and images; evaluate model performance, and develop an intuition for how to improve your model’s performance. By the end of this book, you will master all required concepts of scikit-learn to build efficient models at work to carry out advanced tasks with the practical approach.

Title Page

Credits

About the Author

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

The Fundamentals of Machine Learning

Defining machine learning

Learning from experience

Machine learning tasks

Training data, testing data, and validation data

Bias and variance

An introduction to scikit-learn

Installing scikit-learn

Installing pandas, Pillow, NLTK, and matplotlib

Summary

Simple Linear Regression

Simple linear regression

Evaluating the model

Summary

Classification and Regression with k-Nearest Neighbors

K-Nearest Neighbors

Lazy learning and non-parametric models

Classification with KNN

Regression with KNN

Summary

Feature Extraction

Extracting features from categorical variables

Standardizing features

Extracting features from text

Extracting features from images

Summary

From Simple Linear Regression to Multiple Linear Regression

Multiple linear regression

Polynomial regression

Regularization

Applying linear regression

Gradient descent

Summary

From Linear Regression to Logistic Regression

Binary classification with logistic regression

Spam filtering

Tuning models with grid search

Multi-class classification

Multi-label classification and problem transformation

Summary

Naive Bayes

Bayes' theorem

Generative and discriminative models

Naive Bayes

Naive Bayes with scikit-learn

Summary

Nonlinear Classification and Regression with Decision Trees

Decision trees

Training decision trees

Decision trees with scikit-learn

Summary

From Decision Trees to Random Forests and Other Ensemble Methods

Bagging

Boosting

Stacking

Summary

The Perceptron

The perceptron

Limitations of the perceptron

Summary

From the Perceptron to Support Vector Machines

Kernels and the kernel trick

Maximum margin classification and support vectors

Classifying characters in scikit-learn

Summary

From the Perceptron to Artificial Neural Networks

Nonlinear decision boundaries

Feed-forward and feedback ANNs

Multi-layer perceptrons

Training multi-layer perceptrons

Summary

K-means

Clustering

K-means

Evaluating clusters

Image quantization

Clustering to learn features

Summary

Dimensionality Reduction with Principal Component Analysis

Principal component analysis

Visualizing high-dimensional data with PCA

Face recognition with PCA

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Bias and variance

Many metrics can be used to measure whether or not a program is learning to perform its task more effectively. For supervised learning problems, many performance metrics measure the amount of prediction error. There are two fundamental causes of prediction error: a model's bias, and its variance. Assume that you have many training sets that are all unique, but equally representative of the population. A model with high bias will produce similar errors for an input regardless of the training set it used to learn; the model biases its own assumptions about the real relationship over the relationship demonstrated in the training data. A model with high variance, conversely, will produce different errors for an input depending on the training set that it used to learn. A model with high bias is inflexible, but a model with high variance may be so flexible that it models the noise in the training set. That is, a model with high variance over-fits the training data, while a model with high bias under-fits the training data. It can be helpful to visualize bias and variance as darts thrown at a dartboard. Each dart is analogous to a prediction, and is thrown by a model trained on a different dataset every time. A model with high bias but low variance will throw darts that will be tightly clustered, but could be far from the bulls-eye. A model with high bias and high variance will throw darts all over the board; the darts are far from the bulls-eye and from each other. A model with low bias and high variance will throw darts that could be poorly clustered but close to the bulls-eye. Finally, a model with low bias and low variance will throw darts that are tightly clustered around the bulls-eye.

Ideally, a model will have both low bias and variance, but efforts to decrease one will frequently increase the other. This is known as the bias-variance trade-off. We will discuss the biases and variances of many of the models introduced in this book.

Unsupervised learning problems do not have an error signal to measure; instead, performance metrics for unsupervised learning problems measure some attribute of the structure discovered in the data, such as the distances within and between clusters.

Most performance measures can only be calculated for a specific type of task, like classification or regression. Machine learning systems should be evaluated using performance measures that represent the costs associated with making errors in the real world. While this may seem obvious, the following example describes this using a performance measure that is appropriate for the task in general but not for its specific application.

Consider a classification task in which a machine learning system observes tumors and must predict whether they are malignant or benign. Accuracy, or the fraction of instances that were classified correctly, is an intuitive measure of the program's performance. While accuracy does measure the program's performance, it does not differentiate between malignant tumors that were classified as being benign, and benign tumors that were classified as being malignant. In some applications, the costs associated with all types of errors may be the same. In this problem, however, failing to identify malignant tumors is likely a more severe error than mistakenly classifying benign tumors as being malignant.

We can measure each of the possible prediction outcomes to create different views of the classifier's performance. When the system correctly classifies a tumor as being malignant, the prediction is called a true positive. When the system incorrectly classifies a benign tumor as being malignant, the prediction is a false positive. Similarly, a false negative is an incorrect prediction that the tumor is benign, and a true negative is a correct prediction that a tumor is benign. Note that positive and negative are used only as binary labels, and are not meant to judge the phenomena they signify. In this example, it does not matter whether malignant tumors are coded as positive or negative, so long as they are coded consistently. True and false positives and negatives can be used to calculate several common measures of classification performance, including accuracy, precision and recall.

Accuracy is calculated with the following formula, where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives:

Precision is the fraction of the tumors that were predicted to be malignant that are actually malignant. Precision is calculated with the following formula:

Recall is the fraction of malignant tumors that the system identified. Recall is calculated with the following formula:

In this example, precision measures the fraction of tumors that were predicted to be malignant that are actually malignant. Recall measures the fraction of truly malignant tumors that were detected.

The precision and recall measures could reveal that a classifier with impressive accuracy actually fails to detect most of the malignant tumors. If most tumors in the testing set are benign, even a classifier that never predicts malignancy could have high accuracy. A different classifier with lower accuracy and higher recall might be better suited to the task, since it will detect more of the malignant tumors.

Many other performance measures for classification can be used. We will discuss more metrics, including metrics for multi-label classification problems, in later chapters. In the next chapter we will discuss some common performance measures for regression tasks. Performance on unsupervised tasks can also be assessed; we will discuss some performance measures for cluster analysis later in the book.

Mastering Machine Learning with scikit-learn - Second Edition

By : Gavin Hackeling

Mastering Machine Learning with scikit-learn - Second Edition

By: Gavin Hackeling

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Machine Learning with scikit-learn - Second Edition

Python Machine Learning, Second Edition

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

Machine Learning with PyTorch and Scikit-Learn

Bias and variance