scikit-learn Cookbook

scikit-learn Cookbook

By : Trent Hauck

Buy this Book

scikit-learn Cookbook

By: Trent Hauck

Buy this Book

Overview of this book

<p>Python is quickly becoming the go-to language for analysts and data scientists due to its simplicity and flexibility, and within the Python data space, scikit-learn is the unequivocal choice for machine learning. Its consistent API and plethora of features help solve any machine learning problem it comes across.</p> <p>The book starts by walking through different methods to prepare your data—be it a dataset with missing values or text columns that require the categories to be turned into indicator variables. After the data is ready, you'll learn different techniques aligned with different objectives—be it a dataset with known outcomes such as sales by state, or more complicated problems such as clustering similar customers. Finally, you'll learn how to polish your algorithm to ensure that it's both accurate and resilient to new datasets.</p>

scikit-learn Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Premodel Workflow

Introduction

Getting sample data from external sources

Creating sample data for toy analysis

Scaling data to the standard normal

Creating binary features through thresholding

Working with categorical variables

Binarizing label features

Imputing missing values through various strategies

Using Pipelines for multiple preprocessing steps

Reducing dimensionality with PCA

Using factor analysis for decomposition

Kernel PCA for nonlinear dimensionality reduction

Using truncated SVD to reduce dimensionality

Decomposition to classify with DictionaryLearning

Putting it all together with Pipelines

Using Gaussian processes for regression

Defining the Gaussian process object directly

Using stochastic gradient descent for regression

Working with Linear Models

Introduction

Fitting a line through data

Evaluating the linear regression model

Using ridge regression to overcome linear regression's shortfalls

Optimizing the ridge regression parameter

Using sparsity to regularize models

Taking a more fundamental approach to regularization with LARS

Using linear methods for classification – logistic regression

Directly applying Bayesian ridge regression

Using boosting to learn from errors

Building Models with Distance Metrics

Introduction

Using KMeans to cluster data

Optimizing the number of centroids

Assessing cluster correctness

Using MiniBatch KMeans to handle more data

Quantizing an image with KMeans clustering

Finding the closest objects in the feature space

Probabilistic clustering with Gaussian Mixture Models

Using KMeans for outlier detection

Using k-NN for regression

Classifying Data with scikit-learn

Introduction

Doing basic classifications with Decision Trees

Tuning a Decision Tree model

Using many Decision Trees – random forests

Tuning a random forest model

Classifying data with support vector machines

Generalizing with multiclass classification

Using LDA for classification

Working with QDA – a nonlinear LDA

Using Stochastic Gradient Descent for classification

Classifying documents with Naïve Bayes

Label propagation with semi-supervised learning

Postmodel Workflow

Introduction

K-fold cross validation

Automatic cross validation

Cross validation with ShuffleSplit

Stratified k-fold

Poor man's grid search

Brute force grid search

Using dummy estimators to compare results

Regression model evaluation

Feature selection

Feature selection on L1 norms

Persisting models with joblib

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Label propagation with semi-supervised learning

Label propagation is a semi-supervised technique that makes use of the labeled and unlabeled data to learn about the unlabeled data. Quite often, data that will benefit from a classification algorithm is difficult to label. For example, labeling data might be very expensive, so only a subset is cost-effective to manually label. This said, there does seem to be slow, but growing, support for companies to hire taxonomists.

Getting ready

Another problem area is censored data. You can imagine a case where the frontier of time will affect your ability to gather labeled data. Say, for instance, you took measurements of patients and gave them an experimental drug. In some cases, you are able to measure the outcome of the drug, if it happens fast enough, but you might want to predict the outcome of the drugs that have a slower reaction time. The drug might cause a fatal reaction for some patients, and life-saving measures might need to be taken.

scikit-learn Cookbook

By : Trent Hauck

scikit-learn Cookbook

By: Trent Hauck

Overview of this book

Related Content you might be interested in

Current Title:

scikit-learn Cookbook

Label propagation with semi-supervised learning

Getting ready

How to...