Book Image

Building Machine Learning Systems with Python - Second Edition

By : Luis Pedro Coelho, Willi Richert

Book Image

Building Machine Learning Systems with Python - Second Edition

By: Luis Pedro Coelho, Willi Richert

Overview of this book

<p>Using machine learning to gain deeper insights from data is a key skill required by modern application developers and analysts alike. Python is a wonderful language to develop machine learning applications. As a dynamic language, it allows for fast exploration and experimentation. With its excellent collection of open source machine learning libraries you can focus on the task at hand while being able to quickly try out many ideas.</p> <p>This book shows you exactly how to find patterns in your raw data. You will start by brushing up on your Python machine learning knowledge and introducing libraries. You’ll quickly get to grips with serious, real-world projects on datasets, using modeling, creating recommendation systems. Later on, the book covers advanced topics such as topic modeling, basket analysis, and cloud computing. These will extend your abilities and enable you to create large complex systems.</p> <p>With this book, you gain the tools and understanding required to build your own systems, tailored to solve your real-world data analysis problems.</p>

Building Machine Learning Systems with Python Second Edition

Building Machine Learning Systems with Python Second Edition

Credits

About the Authors

About the Authors

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Getting Started with Python Machine Learning

Getting Started with Python Machine Learning

Machine learning and Python – a dream team

What the book will teach you (and what it will not)

What to do when you are stuck

Getting started

Our first (tiny) application of machine learning

Classifying with Real-world Examples

Classifying with Real-world Examples

The Iris dataset

Building more complex classifiers

A more complex dataset and a more complex classifier

Classifying with scikit-learn

Binary and multiclass classification

Clustering – Finding Related Posts

Clustering – Finding Related Posts

Measuring the relatedness of posts

Preprocessing – similarity measured as a similar number of common words

Solving our initial challenge

Tweaking the parameters

Topic Modeling

Latent Dirichlet allocation

Comparing documents by topics

Choosing the number of topics

Classification – Detecting Poor Answers

Classification – Detecting Poor Answers

Sketching our roadmap

Learning to classify classy answers

Fetching the data

Creating our first classifier

Deciding how to improve

Using logistic regression

Looking behind accuracy – precision and recall

Slimming the classifier

Classification II – Sentiment Analysis

Classification II – Sentiment Analysis

Sketching our roadmap

Fetching the Twitter data

Introducing the Naïve Bayes classifier

Creating our first classifier and tuning it

Cleaning tweets

Taking the word types into account

Regression

Predicting house prices with regression

Penalized or regularized regression

Recommendations

Recommendations

Rating predictions and recommendations

Basket analysis

Classification – Music Genre Classification

Classification – Music Genre Classification

Sketching our roadmap

Fetching the music data

Looking at music

Using FFT to build our first classifier

Improving classification performance with Mel Frequency Cepstral Coefficients

Computer Vision

Computer Vision

Introducing image processing

Local feature representations

Dimensionality Reduction

Dimensionality Reduction

Sketching our roadmap

Selecting features

Feature extraction

Multidimensional scaling

Bigger Data

Learning about big data

Using Amazon Web Services

Where to Learn More Machine Learning

Where to Learn More Machine Learning

All that was left out

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Index

A

AcceptedAnswerId / Preselection and processing of attributes
access key
- about / Using Amazon Web Services
add-one smoothing / Accounting for unseen words and other oddities
additive smoothing / Accounting for unseen words and other oddities
Amazon
- URL / Using Amazon Web Services
Amazon Web Services (AWS)
- about / Using Amazon Web Services
- using / Using Amazon Web Services
- accessing / Using Amazon Web Services
- virtual machines, creating / Creating your first virtual machines
- cluster generation, automating with StarCluster / Automating the generation of clusters with StarCluster
Anaconda Python distribution
- reference link / Installing Python
area under curve (AUC) / Looking behind accuracy – precision and recall
argmax
- about / Using Naïve Bayes to classify
Associated Press (AP) / Building a topic model
association rules
- about / Association rule mining
Auditory Filterbank Temporal Envelope (AFTE) / Improving classification performance with Mel Frequency Cepstral Coefficients
Automatic Music Genre Classification (AMGC) / Improving classification performance with Mel Frequency Cepstral Coefficients
AvgSentLen
- about / Designing more features
AvgWordLen
- about / Designing more features

B

bag of word approach
- raw text, converting into bag of words / Converting raw text into a bag of words
- words, counting / Counting words
- word count vectors, normalizing / Normalizing word count vectors
- less important words, removing / Removing less important words
- stemming / Stemming
- words, stopping on steroids / Stop words on steroids
- drawbacks / Our achievements and goals
bag of words model / Local feature representations
BaseEstimator
- about / Our first estimator
basket analysis
- about / Basket analysis
- useful predictions, obtaining / Obtaining useful predictions
- supermarket shopping baskets, analyzing / Analyzing supermarket shopping baskets
- association rule mining / Association rule mining
- advanced baskets analysis / More advanced basket analysis
BernoulliNB
- about / Creating our first classifier and tuning it
big data
- about / Learning about big data
- pipeline, breaking into tasks with jug / Using jug to break up your pipeline into tasks
- tasks, introducing in jug / An introduction to tasks in jug
- functioning, of jug / Looking under the hood
- jug, using for data analysis / Using jug for data analysis
- partial results, reusing / Reusing partial results
binary classification
- about / Binary and multiclass classification
blogs, machine learning
- reference links / Blogs
Body attribute / Preselection and processing of attributes

C

classes / Learning to classify classy answers
classification model
- building / Building our first classification model
- data, holding / Evaluation – holding out data and cross-validation
- cross-validation / Evaluation – holding out data and cross-validation
- structure / Building more complex classifiers
- search procedure / Building more complex classifiers
- gain or loss function / Building more complex classifiers
classifier / Tuning the classifier
- roadmap, sketching / Sketching our roadmap
- classy answers, classifying / Learning to classify classy answers
- data instance, tuning / Tuning the instance
- tuning / Tuning the classifier
- data, fetching / Fetching the data
- creating / Creating our first classifier
- kNN, starting with / Starting with kNN
- features, engineering / Engineering the features
- training / Training the classifier
- performance, measuring / Measuring the classifier's performance
- features, designing / Designing more features
- logistic regression, using / Using logistic regression
- precision, measuring / Looking behind accuracy – precision and recall
- recall, measuring / Looking behind accuracy – precision and recall
- slimming / Slimming the classifier
- serializing / Ship it!
- building, with FFT / Using FFT to build our first classifier
- experimentation agility, increasing / Increasing experimentation agility
- logistic regression classifier, using / Training the classifier
- confusion matrix, using / Using a confusion matrix to measure accuracy in multiclass problems
- performance, measuring with Receiver-Operator Characteristic (ROC) / An alternative way to measure classifier performance using receiver-operator characteristics
- performance, improving with Mel Frequency Cepstrum (MFC) / Improving classification performance with Mel Frequency Cepstral Coefficients
clustering
- about / Clustering
- hierarchical clustering / Clustering
- k-means / K-means
- testing / Getting test data to evaluate our ideas on
- posts / Clustering posts
clustering approaches
- reference link / Clustering
coefficient of determination
- about / Predicting house prices with regression
CommentCount / Preselection and processing of attributes
compactness / Features and feature engineering
complex classifier
- nearest neighbor classifier / Nearest neighbor classification
complex classifiers
- building / Building more complex classifiers
complex dataset
- about / A more complex dataset and a more complex classifier
- Seeds dataset / Learning about the Seeds dataset
- feature engineering / Features and feature engineering
computer vision
- image processing / Introducing image processing
- local feature representations / Local feature representations
Coursera
- URL / Online courses
CreationDate / Preselection and processing of attributes
cross-validation / Evaluation – holding out data and cross-validation
cross-validation schedule / Evaluation – holding out data and cross-validation
Cross Validated
- URL / What to do when you are stuck, Question and answer sites
- about / Question and answer sites

D

data, classifier
- fetching / Fetching the data
- slimming, to chewable chunks / Slimming the data down to chewable chunks
- attributes, preselecting / Preselection and processing of attributes
- training data, creating / Defining what is a good answer
data sources, machine learning
- about / Data sources
dimensionality reduction / Comparing documents by topics
- roadmap, sketching / Sketching our roadmap
- features, selecting / Selecting features
- feature extraction / Feature extraction
- multidimensional scaling / Multidimensional scaling
documents
- comparing by topics / Comparing documents by topics

E

Elastic Compute Cluster (EC2) service
- about / Using Amazon Web Services
ElasticNet model / L1 and L2 penalties
English-language Wikipedia model
- building / Modeling the whole of Wikipedia
ensemble learning / Combining multiple methods
Enthought Canopy
- reference link / Installing Python

F

F-measure / Tuning the classifier's parameters
feature engineering
- about / What the book will teach you (and what it will not)
/ Features and feature engineering
feature extraction
- about / Feature extraction
- principal component analysis (PCA) / About principal component analysis
- PCA, sketching / Sketching PCA
- PCA, applying / Applying PCA
- PCA, limitations / Limitations of PCA and how LDA can help
- linear discriminant analysis (LDA) / Limitations of PCA and how LDA can help
features
- about / The Iris dataset
feature selection / Features and feature engineering
features selection
- about / Selecting features
- redundant features, detecting with filters / Detecting redundant features using filters
- correlation / Correlation
- mutual information / Mutual information
- model, features asking for / Asking the model about the features using wrappers
- methods / Other feature selection methods
FFT
- used, for building classifier / Using FFT to build our first classifier
first tiny application, machine learning
- about / Our first (tiny) application of machine learning
- data, reading in / Reading in the data
- data, preprocessing / Preprocessing and cleaning the data
- data, cleaning / Preprocessing and cleaning the data
- model, selecting / Choosing the right model and learning algorithm, Before building our first model…
- learning algorithm, selecting / Choosing the right model and learning algorithm
fit(document, y=None) method
- about / Our first estimator
free tier
- about / Using Amazon Web Services

G

GaussianNB
- about / Creating our first classifier and tuning it
get_feature_names() method
- about / Our first estimator
Grid Engine / Using jug to break up your pipeline into tasks
GridSearchCV
- about / Tuning the classifier's parameters

H

hierarchical clustering
- about / Clustering
hierarchical Dirichlet (HDP) process / Choosing the number of topics
house prices, predicting with regression
- about / Predicting house prices with regression
- multidimensional regression / Multidimensional regression
- cross-validation, for regression / Cross-validation for regression

I

image processing
- about / Introducing image processing
- images, loading / Loading and displaying images
- images, displaying / Loading and displaying images
- thresholding / Thresholding
- Gaussian blurring / Gaussian blurring
- center, putting in focus / Putting the center in focus
- basic image classification / Basic image classification
- features, computing from images / Computing features from images
- custom features, writing / Writing your own features
- features, used for finding similar images / Using features to find similar images
- harder dataset, classifying / Classifying a harder dataset
improvement, classifier
- steps / Deciding how to improve
- bias-variance / Bias-variance and their tradeoff
- high bias, fixing / Fixing high bias
- high variance, fixing / Fixing high variance
- high bias / High bias or low bias
- high variance problem, hinting / High bias or low bias
initial challenge
- solving / Solving our initial challenge
- impression of noise example / Another look at noise
instance / Creating your first virtual machines
International Society for Music Information Retrieval (ISMIR) / Improving classification performance with Mel Frequency Cepstral Coefficients
inverse document frequency (TF-IDF) / Stop words on steroids
Iris dataset
- about / The Iris dataset
- features / The Iris dataset
- visualization / Visualization is a good first step
- classification model, building / Building our first classification model

J

jug
- working / Looking under the hood
- using, for data analysis / Using jug for data analysis
- online documentation / Reusing partial results
- running, on cloud machine / Running jug on our cloud machine
jug cleanup
- about / Reusing partial results
jug invalidate
- about / Reusing partial results
jug status --cache
- about / Reusing partial results

K

k-means
- about / K-means
Kaggle
- URL / Machine learning and Python – a dream team, What to do when you are stuck, Getting competitive

L

labels / Learning to classify classy answers
Laplace smoothing / Accounting for unseen words and other oddities
Lasso / L1 and L2 penalties
latent Dirichlet allocation (LDA)
- about / Latent Dirichlet allocation
- Wikipedia URL / Latent Dirichlet allocation
- topic model, building / Building a topic model
lift
- about / Association rule mining
linear discriminant analysis (LDA) / Sketching our roadmap
- about / Limitations of PCA and how LDA can help
local feature representations
- about / Local feature representations
logistic regression
- about / Using logistic regression
- using / Using logistic regression
- example / A bit of math with a small example
- applying, to post classification problem / Applying logistic regression to our post classification problem
LSF (Load Sharing Facility) / Using jug to break up your pipeline into tasks

M

machine learning
- about / Machine learning and Python – a dream team
- first tiny application / Our first (tiny) application of machine learning
machine learning algorithm
- about / What the book will teach you (and what it will not)
Machine Learning Toolkit (Milk)
- URL / All that was left out
matplotlib
- URL / Introduction to NumPy, SciPy, and matplotlib
- about / Introduction to NumPy, SciPy, and matplotlib
matshow() function / Using a confusion matrix to measure accuracy in multiclass problems
MDP toolkit
- URL / All that was left out
Mel Frequency Cepstrum (MFC)
- used, for improving classification performance / Improving classification performance with Mel Frequency Cepstral Coefficients
MetaOptimize
- URL / What to do when you are stuck, Question and answer sites
- about / Question and answer sites
MLComp
- URL / Getting test data to evaluate our ideas on
model, first tiny application
- selecting / Before building our first model…
- straight line model / Starting with a simple straight line
- complex model / Towards some advanced stuff
- data, viewing / Stepping back to go forward – another look at our data
- training / Training and testing
- testing / Training and testing
- model function, calculating / Answering our initial question
mpmath
- URL / Accounting for arithmetic underflows
multiclass classification
- about / Binary and multiclass classification
multidimensional regression
- about / Multidimensional regression
- using / Multidimensional regression
multidimensional scaling (MDS) / Sketching our roadmap
- about / Multidimensional scaling
MultinomialNB
- about / Creating our first classifier and tuning it
MultinomialNB classifier / Tuning the classifier's parameters
music
- analyzing / Looking at music
- decomposing, into sine wave components / Decomposing music into sine wave components
music data
- fetching / Fetching the music data
- wave format, converting into / Converting into a WAV format

N

Natural Language Toolkit (NLTK) / Stemming
- installing / Installing and using NLTK
- URL / Installing and using NLTK
- vectorizer, extending with / Extending the vectorizer with NLTK's stemmer
Naïve Bayes
- about / Sketching our roadmap
Naïve Bayes classifier
- about / Introducing the Naïve Bayes classifier
- Naïve Bayes theorem / Getting to know the Bayes' theorem
- working / Being naïve
- using, to classify / Using Naïve Bayes to classify
- unseen words, accounting for / Accounting for unseen words and other oddities
- arithmetic underflows, accounting for / Accounting for arithmetic underflows
- GaussianNB / Creating our first classifier and tuning it
- MultinomialNB / Creating our first classifier and tuning it
- BernoulliNB / Creating our first classifier and tuning it
- problem, solving / Solving an easy problem first
- classes, using / Using all classes
- parameters, tuning / Tuning the classifier's parameters
nearest neighbor classifier
- about / Nearest neighbor classification
neighborhood approach, recommendations
- about / A neighborhood approach to recommendations
NumAllCaps
- about / Designing more features
NumExclams
- about / Designing more features
NumPy
- about / Introduction to NumPy, SciPy, and matplotlib
- examples / Chewing data efficiently with NumPy and intelligently with SciPy
- reference link, for examples / Chewing data efficiently with NumPy and intelligently with SciPy
- learning / Learning NumPy
- indexing / Indexing
- nonexisting values, handling / Handling nonexisting values
- runtime, comparing / Comparing the runtime

O

one-dimensional regression
- about / Predicting house prices with regression
online course, machine learning
- URL / Online courses
Otsu / Thresholding
overfitting
- about / Towards some advanced stuff
OwnerUserId / Preselection and processing of attributes

P

parameters, clustering
- tweaking / Tweaking the parameters
Part Of Speech (POS) / Sketching our roadmap
Pattern
- URL / All that was left out
PBS (Portable Batch System) / Using jug to break up your pipeline into tasks
penalized regression
- about / Penalized or regularized regression
- L1 penalties / L1 and L2 penalties
- L2 penalties / L1 and L2 penalties
- Lasso, using in scikit-learn / Using Lasso or ElasticNet in scikit-learn
- ElasticNet, using in scikit-learn / Using Lasso or ElasticNet in scikit-learn
- Lasso path, visualizing / Visualizing the Lasso path
- P greater than N scenarios / P-greater-than-N scenarios
- example, text documents / An example based on text documents
- hyperparameters, setting in principled way / Setting hyperparameters in a principled way
Penn Treebank Project
- URL / Determining the word types
POS column
- about / Successfully cheating using SentiWordNet
POS tag abbreviations / Determining the word types
PostTypeId attribute / Preselection and processing of attributes
pre-processing phase
- achievements / Our achievements and goals
- goals / Our achievements and goals
precision-recall (P/R) / An alternative way to measure classifier performance using receiver-operator characteristics
precision_recall_curve() function / Looking behind accuracy – precision and recall
predictions, rating with regression
- about / Rating predictions and recommendations
- dataset, splitting into training and testing / Splitting into training and testing
- training data, normalizing / Normalizing the training data
preprocessing
- about / Preprocessing – similarity measured as a similar number of common words
principal component analysis (PCA) / Sketching our roadmap
- about / About principal component analysis
- properties / About principal component analysis
- sketching / Sketching PCA
- applying / Applying PCA
- limitations / Limitations of PCA and how LDA can help
PyBrain
- URL / All that was left out
Python
- installing / Installing Python
- reference link / Installing Python
Python packages
- installing, on Amazon Linux / Installing Python packages on Amazon Linux

Q

Q&A sites
- MetaOptimize / What to do when you are stuck
- Cross Validated / What to do when you are stuck
- Stack Overflow / What to do when you are stuck
- TwoToReal / What to do when you are stuck
- Kaggle / What to do when you are stuck

R

Receiver-Operator Characteristic (ROC)
- used, for measuring classifier performance / An alternative way to measure classifier performance using receiver-operator characteristics
- about / An alternative way to measure classifier performance using receiver-operator characteristics
recommendations
- neighborhood approach / A neighborhood approach to recommendations
- regression approach / A regression approach to recommendations
- multiple methods, combining / Combining multiple methods
regression
- cross-validation / Cross-validation for regression
- about / L1 and L2 penalties
regression approach, recommendations
- about / A regression approach to recommendations
- issues / A regression approach to recommendations
resources, machine learning
- online courses / Online courses
- books / Books
- question and answer sites / Question and answer sites
- blogs / Blogs
- data sources / Data sources
- competition / Getting competitive
Ridge Regression / L1 and L2 penalties
roadmap
- sketching / Sketching our roadmap
root mean square error (RMSE)
- about / Predicting house prices with regression
- advantage / Predicting house prices with regression
roundness / Features and feature engineering
running status / Creating your first virtual machines

S

save() function / Increasing experimentation agility
scikit-learn classification
- about / Classifying with scikit-learn
- decision boundaries, examining / Looking at the decision boundaries
scikit-learn module
- about / Classifying with scikit-learn
SciPy
- about / Introduction to NumPy, SciPy, and matplotlib
- URL / Introduction to NumPy, SciPy, and matplotlib
- learning / Learning SciPy
- toolboxes / Learning SciPy
secret key
- about / Using Amazon Web Services
Securities and Exchange Commission (SEC) / An example based on text documents
Seeds dataset
- about / Learning about the Seeds dataset
- features / Learning about the Seeds dataset
sentiment analysis
- roadmap, sketching / Sketching our roadmap
- Twitter data, fetching / Fetching the Twitter data
- Naïve Bayes classifier / Introducing the Naïve Bayes classifier
- first classifier, creating / Creating our first classifier and tuning it
- tweets, cleaning / Cleaning tweets
SentiWordNet
- URL / Successfully cheating using SentiWordNet
similarity measuring
- about / Measuring the relatedness of posts
- bag of word approach / How to do it
SoX
- URL / Converting into a WAV format
sparse
- about / L1 and L2 penalties
sparsity / Building a topic model
specgram function / Looking at music
Speeded Up Robust Features (SURF)
- about / Local feature representations
stacked learning / Combining multiple methods
Stack Overflow
- URL / What to do when you are stuck
StarCluster
- used, for automating cluster generation / Automating the generation of clusters with StarCluster
- about / Automating the generation of clusters with StarCluster
- URL / Automating the generation of clusters with StarCluster
stemming
- about / Stemming

T

Talkbox SciKit
- URL / Improving classification performance with Mel Frequency Cepstral Coefficients
task
- about / An introduction to tasks in jug
testing accuracy / Evaluation – holding out data and cross-validation
TfidfVectorizer parameter / Tuning the classifier's parameters
thresholding
- about / Thresholding
TimeToAnswer / Engineering the features
Title attribute / Preselection and processing of attributes
toolboxes, SciPy
- cluster / Learning SciPy
- constants / Learning SciPy
- fftpack / Learning SciPy
- integrate / Learning SciPy
- interpolate / Learning SciPy
- io / Learning SciPy
- linalg / Learning SciPy
- ndimage / Learning SciPy
- odr / Learning SciPy
- optimize / Learning SciPy
- signal / Learning SciPy
- sparse / Learning SciPy
- spatial / Learning SciPy
- special / Learning SciPy
- stats / Learning SciPy
topics
- documents comparing by / Comparing documents by topics
- number of topics, selecting / Choosing the number of topics
training accuracy / Evaluation – holding out data and cross-validation
train_model()function
- about / Solving an easy problem first
transform(documents) method
- about / Our first estimator
tweets
- cleaning / Cleaning tweets
Twitter data
- fetching / Fetching the Twitter data
two-levels of cross-validation / Setting hyperparameters in a principled way
TwoToReal
- URL / What to do when you are stuck, Question and answer sites

U

underfitting
- about / Stepping back to go forward – another look at our data

V

ViewCount / Preselection and processing of attributes
virtual machines, Amazon Web Services (AWS)
- creating / Creating your first virtual machines
- Python packages, installing on Amazon Linux / Installing Python packages on Amazon Linux
- jug, running on cloud machine / Running jug on our cloud machine
visual words / Local feature representations

W

Wikipedia dump
- URL / Modeling the whole of Wikipedia
word types
- about / Taking the word types into account
- determining / Determining the word types
- estimator / Our first estimator
- implementing / Putting everything together