scikit-learn Cookbook

scikit-learn Cookbook

By : Trent Hauck

Buy this Book

scikit-learn Cookbook

By: Trent Hauck

Buy this Book

Overview of this book

<p>Python is quickly becoming the go-to language for analysts and data scientists due to its simplicity and flexibility, and within the Python data space, scikit-learn is the unequivocal choice for machine learning. Its consistent API and plethora of features help solve any machine learning problem it comes across.</p> <p>The book starts by walking through different methods to prepare your data—be it a dataset with missing values or text columns that require the categories to be turned into indicator variables. After the data is ready, you'll learn different techniques aligned with different objectives—be it a dataset with known outcomes such as sales by state, or more complicated problems such as clustering similar customers. Finally, you'll learn how to polish your algorithm to ensure that it's both accurate and resilient to new datasets.</p>

scikit-learn Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Premodel Workflow

Introduction

Getting sample data from external sources

Creating sample data for toy analysis

Scaling data to the standard normal

Creating binary features through thresholding

Working with categorical variables

Binarizing label features

Imputing missing values through various strategies

Using Pipelines for multiple preprocessing steps

Reducing dimensionality with PCA

Using factor analysis for decomposition

Kernel PCA for nonlinear dimensionality reduction

Using truncated SVD to reduce dimensionality

Decomposition to classify with DictionaryLearning

Putting it all together with Pipelines

Using Gaussian processes for regression

Defining the Gaussian process object directly

Using stochastic gradient descent for regression

Working with Linear Models

Introduction

Fitting a line through data

Evaluating the linear regression model

Using ridge regression to overcome linear regression's shortfalls

Optimizing the ridge regression parameter

Using sparsity to regularize models

Taking a more fundamental approach to regularization with LARS

Using linear methods for classification – logistic regression

Directly applying Bayesian ridge regression

Using boosting to learn from errors

Building Models with Distance Metrics

Introduction

Using KMeans to cluster data

Optimizing the number of centroids

Assessing cluster correctness

Using MiniBatch KMeans to handle more data

Quantizing an image with KMeans clustering

Finding the closest objects in the feature space

Probabilistic clustering with Gaussian Mixture Models

Using KMeans for outlier detection

Using k-NN for regression

Classifying Data with scikit-learn

Introduction

Doing basic classifications with Decision Trees

Tuning a Decision Tree model

Using many Decision Trees – random forests

Tuning a random forest model

Classifying data with support vector machines

Generalizing with multiclass classification

Using LDA for classification

Working with QDA – a nonlinear LDA

Using Stochastic Gradient Descent for classification

Classifying documents with Naïve Bayes

Label propagation with semi-supervised learning

Postmodel Workflow

Introduction

K-fold cross validation

Automatic cross validation

Cross validation with ShuffleSplit

Stratified k-fold

Poor man's grid search

Brute force grid search

Using dummy estimators to compare results

Regression model evaluation

Feature selection

Feature selection on L1 norms

Persisting models with joblib

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Using stochastic gradient descent for regression

In this recipe, we'll get our first taste of stochastic gradient descent. We'll use it for regression here, but for the next recipe, we'll use it for classification.

Getting ready

Stochastic Gradient Descent (SGD) is often an unsung hero in machine learning. Underneath many algorithms, there is SGD doing the work. It's popular due to its simplicity and speed—these are both very good things to have when dealing with a lot of data.

The other nice thing about SGD is that while it's at the core of many ML algorithms computationally, it does so because it easily describes the process. At the end of the day, we apply some transformation on the data, and then we fit our data to the model with some loss function.

How to do it…

If SGD is good on large datasets, we should probably test it on a fairly large dataset:

>>> from sklearn import datasets
>>> X, y = datasets.make_regression(int(1e6))
# Just in case the 1e6 throws you off.
>>> print "{:,}".format(int(1e6))
1,000,000

It's probably worth gaining some intuition about the composition and size of the object. Thankfully, we're dealing with NumPy arrays, so we can just access nbytes. The built-in Python way to access the object size doesn't work for NumPy arrays. This output be system dependent, so you may not get the same results:

>>> print "{:,}".format(X.nbytes)
800,000,000

To get some human perspective, we can convert nbytes to megabytes. There are roughly 1 million bytes in an MB:

>>> X.nbytes / 1e6
800.0

So, the number of bytes per data point is:

>>> X.nbytes / (X.shape[0]*X.shape[1])
8

Well, isn't that tidy, and fairly tangential, for what we're trying to accomplish; however, it's worth knowing how to get the size of the objects you're dealing with.

So, now that we have the data, we can simply fit a SGDRegressor model:

>>> from sklearn import linear_model
>>> sgd = linear_model.SGDRegressor()
>>> train = np.random.choice([True, False], size=len(y), p=[.75, .25])
>>> sgd.fit(X[train], y[train])
SGDRegressor(alpha=0.0001, epsilon=0.1, eta0=0.01, 
             fit_intercept=True, l1_ratio=0.15, 
             learning_rate='invscaling', loss='squared_loss', 
             n_iter=5, penalty='l2', power_t=0.25, random_state=None, 
             shuffle=False, verbose=0, warm_start=False)

So, we have another "beefy" object. The main thing to know now is that our loss function is squared_loss, which is the same thing that occurs during linear regression. Also worth noting is that shuffle will generate a random shuffle of the data. This is useful if you want to break a potentially spurious correlation. With fit_intercept, scikit-learn will automatically include a column of ones. If you like to see more through the output of the fitting, set verbose to 1.

We can then predict, as we previously have, using scikit-learn's consistent API:

You can see we actually got a really good fit. There is barely any variation and the histogram has a nice normal look.

How it works…

Clearly, the fake dataset we used wasn't too bad, but you can imagine datasets with larger magnitudes. For example, if you worked in Wall Street on any given day, there might be two billion transactions on any given exchange in a market. Now, imagine that you have a week's or year's data. Running in-core algorithms does not work with huge volumes of data.

The reason this is normally difficult is that to do standard gradient descent, we're required to calculate the gradient at every step. The gradient has the standard definition from any third calculus course.

The gist of the algorithm is that at each step we calculate a new set of coefficients and update this by a learning rate and the outcome of the objective function.

In pseudo code, this might look like the following:

>>> while not_converged:
       w = w – learning_rate*gradient(cost(w))

The relevant variables are as follows:

w: This is the coefficient matrix.
learning_rate: This shows how big a step to take at each iteration. This might be important to tune if you aren't getting a good convergence.
gradient: This is the matrix of second derivatives.
cost: This is the squared error for regression. We'll see later that this cost function can be adapted to work with classification tasks. This flexibility is one thing that makes SGD so useful.

This will not be so bad, except for the fact that the gradient function is expensive. As the vector of coefficients gets larger, calculating the gradient becomes very expensive. For each update step, we need to calculate a new weight for every point in the data, and then update.

The stochastic gradient descent works slightly differently; instead of the previous definition for batch gradient descent, we'll update the parameter with each new data point. This data point is picked at random, and hence the name stochastic gradient descent.

scikit-learn Cookbook

By : Trent Hauck

scikit-learn Cookbook

By: Trent Hauck

Overview of this book

Related Content you might be interested in

Current Title:

scikit-learn Cookbook

Using stochastic gradient descent for regression

Getting ready

How to do it…

How it works…