Large Scale Machine Learning with Python

Large Scale Machine Learning with Python

By : Luca Massaron, Bastiaan Sjardin, Alberto Boschetti

Buy this Book

Large Scale Machine Learning with Python

By: Luca Massaron, Bastiaan Sjardin, Alberto Boschetti

Buy this Book

Overview of this book

Large Python machine learning projects involve new problems associated with specialized machine learning architectures and designs that many data scientists have yet to tackle. But finding algorithms and designing and building platforms that deal with large sets of data is a growing need. Data scientists have to manage and maintain increasingly complex data projects, and with the rise of big data comes an increasing demand for computational and algorithmic efficiency. Large Scale Machine Learning with Python uncovers a new wave of machine learning algorithms that meet scalability demands together with a high predictive accuracy. Dive into scalable machine learning and the three forms of scalability. Speed up algorithms that can be used on a desktop computer with tips on parallelization and memory allocation. Get to grips with new algorithms that are specifically designed for large projects and can handle bigger files, and learn about machine learning in big data environments. We will also cover the most effective machine learning techniques on a map reduce framework in Hadoop and Spark in Python.

Large Scale Machine Learning with Python

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

First Steps to Scalability

Explaining scalability in detail

Python for large scale machine learning

Python packages

Summary

Scalable Learning in Scikit-learn

Out-of-core learning

Streaming data from sources

Stochastic learning

Feature management with data streams

Summary

Fast SVM Implementations

Datasets to experiment with on your own

Support Vector Machines

Feature selection by regularization

Including non-linearity in SGD

Hyperparameter tuning

Summary

Neural Networks and Deep Learning

The neural network architecture

Neural networks and regularization

Neural networks and hyperparameter optimization

Neural networks and decision boundaries

Deep learning at scale with H2O

Deep learning and unsupervised pretraining

Deep learning with theanets

Autoencoders and unsupervised learning

Summary

Deep Learning with TensorFlow

TensorFlow installation

Machine learning on TensorFlow with SkFlow

Keras and TensorFlow installation

Convolutional Neural Networks in TensorFlow through Keras

CNN's with an incremental approach

GPU Computing

Summary

Classification and Regression Trees at Scale

Bootstrap aggregation

Random forest and extremely randomized forest

Fast parameter optimization with randomized search

CART and boosting

XGBoost

Out-of-core CART with H2O

Summary

Unsupervised Learning at Scale

Unsupervised methods

Feature decomposition – PCA

PCA with H2O

Clustering – K-means

K-means with H2O

LDA

Summary

Distributed Environments – Hadoop and Spark

From a standalone machine to a bunch of nodes

Setting up the VM

The Hadoop ecosystem

Spark

Summary

Practical Machine Learning with Spark

Setting up the VM for this chapter

Sharing variables across cluster nodes

Data preprocessing in Spark

Machine learning with Spark

Summary

Introduction to GPUs and Theano

GPU computing

Theano – parallel computing on the GPU

Installing Theano

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Installing Theano

First, make sure that you install the development version from the Theano page. Note that if you do "$ pip install theano", you might end up with problems. Installing the development version from GitHub directly is a safer bet:

$ git clone git://github.com/Theano/Theano.git
$  pip install Theano

If you want to upgrade Theano, you can use the following command:

$ sudo pip install --upgrade theano

If you have questions and want to connect with the Theano community, you can refer to https://groups.google.com/forum/#!forum/theano-users.

That's it, we are ready to go!

To make sure that we set the directory path toward the Theano folder, we need to do the following:

#!/usr/bin/python
import cPickle as pickle
from six.moves import cPickle as pickle
import os

#set your path to the theano folder here
path = '/Users/Quandbee1/Desktop/pthw/Theano/'

Let's install all the packages that we need:

from theano import tensor
import theano.tensor as T
import theano.tensor.nnet as nnet
import numpy as np
import numpy

In order for Theano to work on the GPU (if you have an NVIDIA card + CUDA installed), we need to configure the Theano framework first.

Normally, NumPy and Theano use the double-precision floating-point format (float64). However, if we want to utilize the GPU for Theano, a 32-bit floating point is used. This means that we have to change the settings between 32- and 64-bits floating points depending on our needs. If you want to see which configuration is used by your system by default, type the following:

print(theano.config.floatX)
output: float64

You can to change your configuration to 32 bits for GPU computing as follows:

theano.config.floatX = 'float32'

Sometimes it is more practical to change the settings via the terminal.

For a 32-bit floating point, type as follows:

$ export THEANO_FLAGS=floatX=float32

For a 64-bit floating point, type as follows:

$ export THEANO_FLAGS=floatX=float64

If you want a certain setting attached to a specific Python script, you can do this:

$ THEANO_FLAGS=floatX=float32 python you_name_here.py

If you want to see which computational method your Theano system is using, type the following:

print(theano.config.device)

If you want to change all the settings, both bits floating point and computational method (GPU or CPU) of a specific piece of script, type as follows:

$ THEANO_FLAGS=device=gpu,floatX=float32 python your_script.py

This can be very handy for the testing and coding. You might not want to use the GPU all the time; sometimes it is better to use the CPU for the prototyping and sketching and run it on the GPU once your script is ready.

First, let's test if GPU works for your setup. You can skip this if you don't have an NVIDIA GPU card on your computer:

from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in xrange(iters):
    r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
else:
    print('Used the gpu')

Now that we know how to configure Theano, let's run through some simple examples to see how it works. Basically, every piece of Theano code is composed of the same structure:

The initialization part where the variables are declared in the class.
The compiling where the functions are formed.
The execution where the functions are applied to data types.

Let's use these principles in some basic examples of vector computations and mathematical expressions:

#Initialize a simple scalar 
x = T.dscalar()


fx = T.exp(T.tan(x**2)) #initialize the function we want to use.

type(fx)            #just to show you that fx is a theano variable type


#Compile create a tanh function
f = theano.function(inputs=[x], outputs=[fx])

#Execute the function on a number in this case

f(10)

As we mentioned before, we can use Theano for mathematical expressions. Look at this example where we use a powerful Theano feature called autodifferentiation, a feature that becomes highly useful for backpropagation:

fp = T.grad(fx, wrt=x)
fs= theano.function([x], fp)


fs(3)

output:] 4.59

Now that we understand the way in which we can use variables and functions, let's perform a simple logistic function:

#now we can apply this function to  matrices as well  
x = T.dmatrix('x')
s = 1 / (1 + T.exp(-x))
logistic = theano.function([x], s)
logistic([[2, 3], [.7, -2],[1.5,2.3]])

output:
array([[ 0.88079708,  0.95257413],
       [ 0.66818777,  0.11920292],
       [ 0.81757448,  0.90887704]])

We can clearly see that Theano provides faster methods of applying functions to data objects than would be possible with NumPy.

Large Scale Machine Learning with Python

By : Luca Massaron, Bastiaan Sjardin, Alberto Boschetti

Large Scale Machine Learning with Python

By: Luca Massaron, Bastiaan Sjardin, Alberto Boschetti

Overview of this book

Related Content you might be interested in

Current Title:

Large Scale Machine Learning with Python

Installing Theano