Building Machine Learning Systems with Python

Building Machine Learning Systems with Python

Overview of this book

Machine learning, the field of building systems that learn from data, is exploding on the Web and elsewhere. Python is a wonderful language in which to develop machine learning applications. As a dynamic language, it allows for fast exploration and experimentation and an increasing number of machine learning libraries are developed for Python.Building Machine Learning system with Python shows you exactly how to find patterns through raw data. The book starts by brushing up on your Python ML knowledge and introducing libraries, and then moves on to more serious projects on datasets, Modelling, Recommendations, improving recommendations through examples and sailing through sound and image processing in detail. Using open-source tools and libraries, readers will learn how to apply methods to text, images, and sounds. You will also learn how to evaluate, compare, and choose machine learning techniques. Written for Python programmers, Building Machine Learning Systems with Python teaches you how to use open-source libraries to solve real problems with machine learning. The book is based on real-world examples that the user can build on. Readers will learn how to write programs that classify the quality of StackOverflow answers or whether a music file is Jazz or Metal. They will learn regression, which is demonstrated on how to recommend movies to users. Advanced topics such as topic modeling (finding a text's most important topics), basket analysis, and cloud computing are covered as well as many other interesting aspects.Building Machine Learning Systems with Python will give you the tools and understanding required to build your own systems, which are tailored to solve your problems.

Building Machine Learning Systems with Python

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Started with Python Machine Learning

Machine learning and Python – the dream team

What the book will teach you (and what it will not)

What to do when you are stuck

Getting started

Our first (tiny) machine learning application

Summary

Learning How to Classify with Real-world Examples

The Iris dataset

Building more complex classifiers

A more complex dataset and a more complex classifier

Binary and multiclass classification

Summary

Clustering – Finding Related Posts

Measuring the relatedness of posts

Preprocessing – similarity measured as similar number of common words

Clustering

Solving our initial challenge

Tweaking the parameters

Summary

Topic Modeling

Latent Dirichlet allocation (LDA)

Comparing similarity in topic space

Choosing the number of topics

Summary

Classification – Detecting Poor Answers

Sketching our roadmap

Learning to classify classy answers

Fetching the data

Creating our first classifier

Deciding how to improve

Using logistic regression

Looking behind accuracy – precision and recall

Slimming the classifier

Ship it!

Summary

Classification II – Sentiment Analysis

Sketching our roadmap

Fetching the Twitter data

Introducing the Naive Bayes classifier

Creating our first classifier and tuning it

Cleaning tweets

Taking the word types into account

Summary

Regression – Recommendations

Predicting house prices with regression

Penalized regression

P greater than N scenarios

Summary

Regression – Recommendations Improved

Improved recommendations

Basket analysis

Summary

Classification III – Music Genre Classification

Sketching our roadmap

Fetching the music data

Looking at music

Using FFT to build our first classifier

Improving classification performance with Mel Frequency Cepstral Coefficients

Summary

Computer Vision – Pattern Recognition

Introducing image processing

Loading and displaying images

Classifying a harder dataset

Local feature representations

Summary

Dimensionality Reduction

Sketching our roadmap

Selecting features

Other feature selection methods

Feature extraction

Multidimensional scaling (MDS)

Summary

Big(ger) Data

Learning about big data

Using jug to break up your pipeline into tasks

Using Amazon Web Services (AWS)

Summary

Where to Learn More about Machine Learning

Online courses

Books

What was left out

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Getting started

Assuming that you have already installed Python (everything at least as recent as 2.7 should be fine), we need to install NumPy and SciPy for numerical operations as well as Matplotlib for visualization.

Introduction to NumPy, SciPy, and Matplotlib

Before we can talk about concrete machine learning algorithms, we have to talk about how best to store the data we will chew through. This is important as the most advanced learning algorithm will not be of any help to us if they will never finish. This may be simply because accessing the data is too slow. Or maybe its representation forces the operating system to swap all day. Add to this that Python is an interpreted language (a highly optimized one, though) that is slow for many numerically heavy algorithms compared to C or Fortran. So we might ask why on earth so many scientists and companies are betting their fortune on Python even in the highly computation-intensive areas?

The answer is that in Python, it is very easy to offload number-crunching tasks to the lower layer in the form of a C or Fortran extension. That is exactly what NumPy and SciPy do (http://scipy.org/install.html). In this tandem, NumPy provides the support of highly optimized multidimensional arrays, which are the basic data structure of most state-of-the-art algorithms. SciPy uses those arrays to provide a set of fast numerical recipes. Finally, Matplotlib (http://matplotlib.org/) is probably the most convenient and feature-rich library to plot high-quality graphs using Python.

Installing Python

Luckily, for all the major operating systems, namely Windows, Mac, and Linux, there are targeted installers for NumPy, SciPy, and Matplotlib. If you are unsure about the installation process, you might want to install Enthought Python Distribution (https://www.enthought.com/products/epd_free.php) or Python(x,y) (http://code.google.com/p/pythonxy/wiki/Downloads), which come with all the earlier mentioned packages included.

Chewing data efficiently with NumPy and intelligently with SciPy

Let us quickly walk through some basic NumPy examples and then take a look at what SciPy provides on top of it. On the way, we will get our feet wet with plotting using the marvelous Matplotlib package.

You will find more interesting examples of what NumPy can offer at http://www.scipy.org/Tentative_NumPy_Tutorial.

You will also find the book NumPy Beginner's Guide - Second Edition, Ivan Idris, Packt Publishing very valuable. Additional tutorial style guides are at http://scipy-lectures.github.com; you may also visit the official SciPy tutorial at http://docs.scipy.org/doc/scipy/reference/tutorial.

In this book, we will use NumPy Version 1.6.2 and SciPy Version 0.11.0.

Learning NumPy

So let us import NumPy and play a bit with it. For that, we need to start the Python interactive shell.

>>> import numpy
>>> numpy.version.full_version
1.6.2

As we do not want to pollute our namespace, we certainly should not do the following:

>>> from numpy import *

The numpy.array array will potentially shadow the array package that is included in standard Python. Instead, we will use the following convenient shortcut:

>>> import numpy as np
>>> a = np.array([0,1,2,3,4,5])
>>> a
array([0, 1, 2, 3, 4, 5])
>>> a.ndim
1
>>> a.shape
(6,)

We just created an array in a similar way to how we would create a list in Python. However, NumPy arrays have additional information about the shape. In this case, it is a one-dimensional array of five elements. No surprises so far.

We can now transform this array in to a 2D matrix.

>>> b = a.reshape((3,2))
>>> b
array([[0, 1],
       [2, 3],
       [4, 5]])
>>> b.ndim
2
>>> b.shape
(3, 2)

The funny thing starts when we realize just how much the NumPy package is optimized. For example, it avoids copies wherever possible.

>>> b[1][0]=77
>>> b
array([[ 0,  1],
       [77,  3],
       [ 4,  5]])
>>> a
array([ 0,  1, 77,  3,  4,  5])

In this case, we have modified the value 2 to 77 in b, and we can immediately see the same change reflected in a as well. Keep that in mind whenever you need a true copy.

>>> c = a.reshape((3,2)).copy()
>>> c
array([[ 0,  1],
       [77,  3],
       [ 4,  5]])
>>> c[0][0] = -99
>>> a
array([ 0,  1, 77,  3,  4,  5])
>>> c
array([[-99,   1],
       [ 77,   3],
       [  4,   5]])

Here, c and a are totally independent copies.

Another big advantage of NumPy arrays is that the operations are propagated to the individual elements.

>>> a*2
array([ 2,  4,  6,  8, 10])
>>> a**2
array([ 1,  4,  9, 16, 25])
Contrast that to ordinary Python lists:
>>> [1,2,3,4,5]*2
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
>>> [1,2,3,4,5]**2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

Of course, by using NumPy arrays we sacrifice the agility Python lists offer. Simple operations like adding or removing are a bit complex for NumPy arrays. Luckily, we have both at our disposal, and we will use the right one for the task at hand.

Indexing

Part of the power of NumPy comes from the versatile ways in which its arrays can be accessed.

In addition to normal list indexing, it allows us to use arrays themselves as indices.

>>> a[np.array([2,3,4])]
array([77,  3,  4])

In addition to the fact that conditions are now propagated to the individual elements, we gain a very convenient way to access our data.

>>> a>4
array([False, False,  True, False, False,  True], dtype=bool)
>>> a[a>4]
array([77,  5])

This can also be used to trim outliers.

>>> a[a>4] = 4
>>> a
array([0, 1, 4, 3, 4, 4])

As this is a frequent use case, there is a special clip function for it, clipping the values at both ends of an interval with one function call as follows:

>>> a.clip(0,4)
array([0, 1, 4, 3, 4, 4])

Handling non-existing values

The power of NumPy's indexing capabilities comes in handy when preprocessing data that we have just read in from a text file. It will most likely contain invalid values, which we will mark as not being a real number using numpy.NAN as follows:

c = np.array([1, 2, np.NAN, 3, 4]) # let's pretend we have read this from a text file
>>> c
array([  1.,   2.,  nan,   3.,   4.])
>>> np.isnan(c)
array([False, False,  True, False, False], dtype=bool)
>>> c[~np.isnan(c)]
array([ 1.,  2.,  3.,  4.])
>>> np.mean(c[~np.isnan(c)])
2.5

Comparing runtime behaviors

Let us compare the runtime behavior of NumPy with normal Python lists. In the following code, we will calculate the sum of all squared numbers of 1 to 1000 and see how much time the calculation will take. We do it 10000 times and report the total time so that our measurement is accurate enough.

import timeit
normal_py_sec = timeit.timeit('sum(x*x for x in xrange(1000))', 
                              number=10000)
naive_np_sec = timeit.timeit('sum(na*na)', 
                             setup="import numpy as np; na=np.arange(1000)",
                             number=10000)
good_np_sec = timeit.timeit('na.dot(na)', 
                            setup="import numpy as np; na=np.arange(1000)",
                            number=10000)

print("Normal Python: %f sec"%normal_py_sec)
print("Naive NumPy: %f sec"%naive_np_sec)
print("Good NumPy: %f sec"%good_np_sec)

Normal Python: 1.157467 sec
Naive NumPy: 4.061293 sec
Good NumPy: 0.033419 sec

We make two interesting observations. First, just using NumPy as data storage (Naive NumPy) takes 3.5 times longer, which is surprising since we believe it must be much faster as it is written as a C extension. One reason for this is that the access of individual elements from Python itself is rather costly. Only when we are able to apply algorithms inside the optimized extension code do we get speed improvements, and tremendous ones at that: using the dot() function of NumPy, we are more than 25 times faster. In summary, in every algorithm we are about to implement, we should always look at how we can move loops over individual elements from Python to some of the highly optimized NumPy or SciPy extension functions.

However, the speed comes at a price. Using NumPy arrays, we no longer have the incredible flexibility of Python lists, which can hold basically anything. NumPy arrays always have only one datatype.

>>> a = np.array([1,2,3])
>>> a.dtype
dtype('int64')

If we try to use elements of different types, NumPy will do its best to coerce them to the most reasonable common datatype:

>>> np.array([1, "stringy"])
array(['1', 'stringy'], dtype='|S8')
>>> np.array([1, "stringy", set([1,2,3])])
array([1, stringy, set([1, 2, 3])], dtype=object)

Learning SciPy

On top of the efficient data structures of NumPy, SciPy offers a magnitude of algorithms working on those arrays. Whatever numerical-heavy algorithm you take from current books on numerical recipes, you will most likely find support for them in SciPy in one way or another. Whether it is matrix manipulation, linear algebra, optimization, clustering, spatial operations, or even Fast Fourier transformation, the toolbox is readily filled. Therefore, it is a good habit to always inspect the scipy module before you start implementing a numerical algorithm.

For convenience, the complete namespace of NumPy is also accessible via SciPy. So, from now on, we will use NumPy's machinery via the SciPy namespace. You can check this easily by comparing the function references of any base function; for example:

>>> import scipy, numpy
>>> scipy.version.full_version
0.11.0
>>> scipy.dot is numpy.dot
True

The diverse algorithms are grouped into the following toolboxes:

SciPy package	Functionality
`cluster`	Hierarchical clustering (`cluster.hierarchy`) Vector quantization / K-Means (`cluster.vq`)
`constants`	Physical and mathematical constants Conversion methods
`fftpack`	Discrete Fourier transform algorithms
`integrate`	Integration routines
`interpolate`	Interpolation (linear, cubic, and so on)
`io`	Data input and output
`linalg`	Linear algebra routines using the optimized BLAS and LAPACK libraries
`maxentropy`	Functions for fitting maximum entropy models
`ndimage`	n-dimensional image package
`odr`	Orthogonal distance regression
`optimize`	Optimization (finding minima and roots)
`signal`	Signal processing
`sparse`	Sparse matrices
`spatial`	Spatial data structures and algorithms
`special`	Special mathematical functions such as Bessel or Jacobian
`stats`	Statistics toolkit

The toolboxes most interesting to our endeavor are scipy.stats, scipy.interpolate, scipy.cluster, and scipy.signal. For the sake of brevity, we will briefly explore some features of the stats package and leave the others to be explained when they show up in the chapters.

Building Machine Learning Systems with Python

Building Machine Learning Systems with Python

Overview of this book

Related Content you might be interested in

Current Title:

Building Machine Learning Systems with Python

Getting started

Introduction to NumPy, SciPy, and Matplotlib

Installing Python

Chewing data efficiently with NumPy and intelligently with SciPy

Learning NumPy

Indexing

Handling non-existing values

Comparing runtime behaviors

Learning SciPy