Book Image

Large Scale Machine Learning with Python

By : Bastiaan Sjardin, Alberto Boschetti
Book Image

Large Scale Machine Learning with Python

By: Bastiaan Sjardin, Alberto Boschetti

Overview of this book

Large Python machine learning projects involve new problems associated with specialized machine learning architectures and designs that many data scientists have yet to tackle. But finding algorithms and designing and building platforms that deal with large sets of data is a growing need. Data scientists have to manage and maintain increasingly complex data projects, and with the rise of big data comes an increasing demand for computational and algorithmic efficiency. Large Scale Machine Learning with Python uncovers a new wave of machine learning algorithms that meet scalability demands together with a high predictive accuracy. Dive into scalable machine learning and the three forms of scalability. Speed up algorithms that can be used on a desktop computer with tips on parallelization and memory allocation. Get to grips with new algorithms that are specifically designed for large projects and can handle bigger files, and learn about machine learning in big data environments. We will also cover the most effective machine learning techniques on a map reduce framework in Hadoop and Spark in Python.
Table of Contents (17 chapters)
Large Scale Machine Learning with Python
About the Authors
About the Reviewer

Installing Theano

First, make sure that you install the development version from the Theano page. Note that if you do "$ pip install theano", you might end up with problems. Installing the development version from GitHub directly is a safer bet:

$ git clone git://
$  pip install Theano

If you want to upgrade Theano, you can use the following command:

$ sudo pip install --upgrade theano

If you have questions and want to connect with the Theano community, you can refer to!forum/theano-users.

That's it, we are ready to go!

To make sure that we set the directory path toward the Theano folder, we need to do the following:

import cPickle as pickle
from six.moves import cPickle as pickle
import os

#set your path to the theano folder here
path = '/Users/Quandbee1/Desktop/pthw/Theano/'

Let's install all the packages that we need:

from theano import tensor
import theano.tensor as T
import theano.tensor.nnet as nnet
import numpy as np
import numpy

In order for Theano to work on the GPU (if you have an NVIDIA card + CUDA installed), we need to configure the Theano framework first.

Normally, NumPy and Theano use the double-precision floating-point format (float64). However, if we want to utilize the GPU for Theano, a 32-bit floating point is used. This means that we have to change the settings between 32- and 64-bits floating points depending on our needs. If you want to see which configuration is used by your system by default, type the following:

output: float64

You can to change your configuration to 32 bits for GPU computing as follows:

theano.config.floatX = 'float32'

Sometimes it is more practical to change the settings via the terminal.

For a 32-bit floating point, type as follows:

$ export THEANO_FLAGS=floatX=float32

For a 64-bit floating point, type as follows:

$ export THEANO_FLAGS=floatX=float64

If you want a certain setting attached to a specific Python script, you can do this:

$ THEANO_FLAGS=floatX=float32 python

If you want to see which computational method your Theano system is using, type the following:


If you want to change all the settings, both bits floating point and computational method (GPU or CPU) of a specific piece of script, type as follows:

$ THEANO_FLAGS=device=gpu,floatX=float32 python

This can be very handy for the testing and coding. You might not want to use the GPU all the time; sometimes it is better to use the CPU for the prototyping and sketching and run it on the GPU once your script is ready.

First, let's test if GPU works for your setup. You can skip this if you don't have an NVIDIA GPU card on your computer:

from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
t0 = time.time()
for i in xrange(iters):
    r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
    print('Used the gpu')

Now that we know how to configure Theano, let's run through some simple examples to see how it works. Basically, every piece of Theano code is composed of the same structure:

  1. The initialization part where the variables are declared in the class.

  2. The compiling where the functions are formed.

  3. The execution where the functions are applied to data types.

Let's use these principles in some basic examples of vector computations and mathematical expressions:

#Initialize a simple scalar 
x = T.dscalar()

fx = T.exp(T.tan(x**2)) #initialize the function we want to use.

type(fx)            #just to show you that fx is a theano variable type

#Compile create a tanh function
f = theano.function(inputs=[x], outputs=[fx])

#Execute the function on a number in this case


As we mentioned before, we can use Theano for mathematical expressions. Look at this example where we use a powerful Theano feature called autodifferentiation, a feature that becomes highly useful for backpropagation:

fp = T.grad(fx, wrt=x)
fs= theano.function([x], fp)


output:] 4.59

Now that we understand the way in which we can use variables and functions, let's perform a simple logistic function:

#now we can apply this function to  matrices as well  
x = T.dmatrix('x')
s = 1 / (1 + T.exp(-x))
logistic = theano.function([x], s)
logistic([[2, 3], [.7, -2],[1.5,2.3]])

array([[ 0.88079708,  0.95257413],
       [ 0.66818777,  0.11920292],
       [ 0.81757448,  0.90887704]])

We can clearly see that Theano provides faster methods of applying functions to data objects than would be possible with NumPy.