Book Image

Machine Learning with R - Second Edition

By : Brett Lantz

Book Image

Machine Learning with R - Second Edition

By: Brett Lantz

Overview of this book

Machine Learning with R Second Edition

Machine Learning with R Second Edition

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Introducing Machine Learning

Introducing Machine Learning

The origins of machine learning

Uses and abuses of machine learning

How machines learn

Machine learning in practice

Machine learning with R

Managing and Understanding Data

Managing and Understanding Data

R data structures

Managing data with R

Exploring and understanding data

Lazy Learning – Classification Using Nearest Neighbors

Lazy Learning – Classification Using Nearest Neighbors

Understanding nearest neighbor classification

Example – diagnosing breast cancer with the k-NN algorithm

Probabilistic Learning – Classification Using Naive Bayes

Probabilistic Learning – Classification Using Naive Bayes

Understanding Naive Bayes

Example – filtering mobile phone spam with the Naive Bayes algorithm

Divide and Conquer – Classification Using Decision Trees and Rules

Divide and Conquer – Classification Using Decision Trees and Rules

Understanding decision trees

Example – identifying risky bank loans using C5.0 decision trees

Understanding classification rules

Example – identifying poisonous mushrooms with rule learners

Forecasting Numeric Data – Regression Methods

Forecasting Numeric Data – Regression Methods

Understanding regression

Example – predicting medical expenses using linear regression

Understanding regression trees and model trees

Example – estimating the quality of wines with regression trees and model trees

Black Box Methods – Neural Networks and Support Vector Machines

Black Box Methods – Neural Networks and Support Vector Machines

Understanding neural networks

Example – Modeling the strength of concrete with ANNs

Understanding Support Vector Machines

Example – performing OCR with SVMs

Finding Patterns – Market Basket Analysis Using Association Rules

Finding Patterns – Market Basket Analysis Using Association Rules

Understanding association rules

Example – identifying frequently purchased groceries with association rules

Finding Groups of Data – Clustering with k-means

Finding Groups of Data – Clustering with k-means

Understanding clustering

Example – finding teen market segments using k-means clustering

Evaluating Model Performance

Evaluating Model Performance

Measuring performance for classification

Estimating future performance

Improving Model Performance

Improving Model Performance

Tuning stock models for better performance

Improving model performance with meta-learning

Specialized Machine Learning Topics

Specialized Machine Learning Topics

Working with proprietary files and databases

Working with online data and services

Working with domain-specific data

Improving the performance of R

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Index

A

abstraction / Abstraction
activation function / From biological to artificial neurons
- about / Activation functions
- threshold activation function / Activation functions
- unit step activation function / Activation functions
- sigmoid activation function / Activation functions
AdaBoost
- about / Boosting
AdaBoost.M1 algorithm / Boosting
adaptive boosting
- about / Boosting the accuracy of decision trees, Boosting
allocation function / Understanding ensembles
Apache Hadoop
- about / Parallel cloud computing with MapReduce and Hadoop
Application Programming Interfaces (APIs)
- about / Parsing JSON from web APIs
Apriori
- property / The Apriori algorithm for association rule learning
Apriori algorithm
- for association rule learning / The Apriori algorithm for association rule learning
- strengths / The Apriori algorithm for association rule learning
Apriori principle
- used, for building set of rules / Building a set of rules with the Apriori principle
Artificial Neural Network (ANN)
- about / Understanding neural networks
association rules
- about / Understanding association rules
- potential applications / Understanding association rules
- rule interest, measuring / Measuring rule interest – support and confidence
- set of rules, building with Apriori principle / Building a set of rules with the Apriori principle
- frequently purchased groceries, identifying with / Example – identifying frequently purchased groceries with association rules
automated parameter tuning
- caret package used for / Using caret for automated parameter tuning
- requisites / Using caret for automated parameter tuning
axon
- about / From biological to artificial neurons

B

backpropagation
- neural networks, training with / Training neural networks with backpropagation
- about / Training neural networks with backpropagation
bag-of-words / Step 2 – exploring and preparing the data
bagging
- about / Bagging
bank loans example, with C5.0 decision trees
- data, collecting / Step 1 – collecting data
- data, exploring / Step 2 – exploring and preparing the data
- data, preparing / Step 2 – exploring and preparing the data
- random training, creating / Data preparation – creating random training and test datasets
- test datasets, creating / Data preparation – creating random training and test datasets
- model, training on data / Step 3 – training a model on the data
- model performance, evaluating / Step 4 – evaluating model performance
- model performance, improving / Step 5 – improving model performance
Bayesian methods
- basic concepts / Basic concepts of Bayesian methods
Bayesian methods,basics concepts
- joint probability / Understanding joint probability
- conditional probability / Computing conditional probability with Bayes' theorem
Bayesian methods, basics concepts
- probability / Understanding probability
Beowulf cluster
- about / Working in parallel with multicore and snow
betweenness centrality
- about / Analyzing and visualizing network data
bias / The case of linearly separable data
bias-variance tradeoff / Choosing an appropriate k
biglm package
- regression models, building / Building bigger regression models with biglm
bigmemory package
- massive matrices, using with / Using massive matrices with bigmemory
- URL / Using massive matrices with bigmemory
bigrf package
- random forests, building / Growing bigger and faster random forests with bigrf
- URL / Growing bigger and faster random forests with bigrf
bimodal / Measuring the central tendency – the mode
binning
- about / Using numeric features with Naive Bayes
bins
- about / Using numeric features with Naive Bayes
Bioconductor
- about / Analyzing bioinformatics data
- URL / Analyzing bioinformatics data
bioinformatics
- about / Analyzing bioinformatics data
bioinformatics data
- analyzing / Analyzing bioinformatics data
bivariate relationships
- about / Exploring relationships between variables
blind tasting experience example / The k-NN algorithm
blowby / Simple linear regression
body mass index (BMI) / Step 1 – collecting data
boosting
- about / Boosting
bootstrap aggregating
- about / Bagging
bootstrap sampling / Bootstrap sampling
box-and-whiskers plot / Visualizing numeric variables – boxplots
branches
- about / Understanding decision trees
breast cancer
- diagnosing, with k-NN algorithm / Example – diagnosing breast cancer with the k-NN algorithm
breast cancer example
- data, collecting / Step 1 – collecting data
- data, exploring / Step 2 – exploring and preparing the data
- data, preparing / Step 2 – exploring and preparing the data
- model, training on data / Step 3 – training a model on the data
- model performance, evaluating / Step 4 – evaluating model performance
- model performance, improving / Step 5 – improving model performance

C

C5.0 algorithm
- about / The C5.0 decision tree algorithm
- split, selecting / Choosing the best split
- decision tree, pruning / Pruning the decision tree
caret package
- using, for automated parameter tuning / Using caret for automated parameter tuning
- URL / Using caret for automated parameter tuning, Training and evaluating models in parallel with caret
- used, for evaluating models in parallel / Training and evaluating models in parallel with caret
categorical / Types of input data
categorical variables
- about / Exploring categorical variables
- central tendency, measuring / Measuring the central tendency – the mode
cell body / From biological to artificial neurons
centroid / Using distance to assign and update clusters
characteristics, neural networks
- activation function / From biological to artificial neurons
- network topology / From biological to artificial neurons
- training algorithm / From biological to artificial neurons
classification / Types of machine learning algorithms
classification and regression training (caret package) / Beyond accuracy – other measures of performance
Classification and Regression Tree (CART) algorithm / Understanding regression trees and model trees
classification performance
- measuring / Measuring performance for classification
classification prediction data-classification prediction data
- working with / Working with classification prediction data in R
classification rules
- about / Understanding classification rules
- separate and conquer / Separate and conquer
- 1 R algorithm / The 1R algorithm
- RIPPER algorithm / The RIPPER algorithm
- obtaining, from decision trees / Rules from decision trees
class imbalance problem / Measuring performance for classification
clustering / Types of machine learning algorithms
- about / Understanding clustering
- as machine learning task / Clustering as a machine learning task
clustering, k-means clustering algorithm
- about / The k-means clustering algorithm
- distance, used for assigning cluster / Using distance to assign and update clusters
- distance, used for updating cluster / Using distance to assign and update clusters
- appropriate number of clusters, selecting / Choosing the appropriate number of clusters
column-major order / Matrixes and arrays
combination function / Understanding ensembles
Complete Unified Device Architecture (CUDA)
- about / GPU computing
Comprehensive R Archive Network (CRAN)
- about / Machine learning with R
- URL / Machine learning with R
concrete strength, modeling with ANNs
- about / Example – Modeling the strength of concrete with ANNs
- data, collecting / Step 1 – collecting data
- data, preparing / Step 2 – exploring and preparing the data
- data, exploring / Step 2 – exploring and preparing the data
- model, training on data / Step 3 – training a model on the data
- model performance, evaluating / Step 4 – evaluating model performance
- model performance, improving / Step 5 – improving model performance
conditional probability
- about / Computing conditional probability with Bayes' theorem
confusion matrix
- about / A closer look at confusion matrices
- used, for measuring performance / Using confusion matrices to measure performance
control object / Customizing the tuning process
convex hull / The case of linearly separable data
corpus / Data preparation – cleaning and standardizing text data
correlation
- about / Correlations
CRAN
- about / Improving the performance of R
- URL / Improving the performance of R
CRAN task view
- URL / Analyzing bioinformatics data
CRAN Web Technologies
- URL / Working with online data and services
cross-validation / Cross-validation
CSV (Comma-Separated Values) file
- about / Importing and saving data from CSV files
CSV files
- data, importing from / Importing and saving data from CSV files
curl utility
- about / Downloading the complete text of web pages
cut points
- about / Using numeric features with Naive Bayes

D

data
- managing, with R / Managing data with R
- importing, from CSV files / Importing and saving data from CSV files
data.table package
- using / Making data frames faster with data.table
- URL / Making data frames faster with data.table
Database Management Systems (DBMSs)
- about / Querying data in SQL databases
databases
- about / Working with proprietary files and databases
- data, querying in SQL databases / Querying data in SQL databases
data dictionary
- about / Exploring the structure of data
data exploration
- about / Exploring and understanding data
data frame
- about / Data frames
data mining
- about / The origins of machine learning
data munging
- about / Working with proprietary files and databases
data preparation, breast cancer example
- training, creating / Data preparation – creating training and test datasets
- test datasets, creating / Data preparation – creating training and test datasets
Data Source Name (DSN)
- about / Querying data in SQL databases
data storage / Data storage
data structures, R
- about / R data structures
- vector / Vectors
- factor / Factors
- lists / Lists
- data frame / Data frames
- matrix / Matrixes and arrays
- array / Matrixes and arrays
- saving / Saving, loading, and removing R data structures
- loading / Saving, loading, and removing R data structures
- removing / Saving, loading, and removing R data structures
- exploring / Exploring the structure of data
data table
- about / Making data frames faster with data.table
data wrangling
- about / Working with proprietary files and databases
decision nodes
- about / Understanding decision trees
decision tree
- potential uses / Understanding decision trees
- about / Understanding decision trees, Example – identifying risky bank loans using C5.0 decision trees
- divide and conquer / Divide and conquer
- pruning / Pruning the decision tree
- used, for identifying risky bank loans / Example – identifying risky bank loans using C5.0 decision trees
- accuracy, boosting / Boosting the accuracy of decision trees
decision tree forests
- about / Random forests
decision trees
- classification rules, obtaining from / Rules from decision trees
deep learning
- about / The direction of information travel
Deep Neural Network (DNN)
- about / The direction of information travel
delimiter
- about / Importing and saving data from CSV files
dendrites
- about / From biological to artificial neurons
dependent events / Understanding joint probability
dependent variable
- about / Understanding regression
descriptive model / Types of machine learning algorithms
disk-based data frames
- creating, with ff package / Creating disk-based data frames with ff
divide and conquer
- about / Divide and conquer
domain-specific data
- working with / Working with domain-specific data
- bioinformatics data, analyzing / Analyzing bioinformatics data
- network data, analyzing / Analyzing and visualizing network data
- network data, visualizing / Analyzing and visualizing network data
doParallel package
- using / Taking advantage of parallel with foreach and doParallel
dplyr package
- used, for generalizing tabular data structures / Generalizing tabular data structures with dplyr
- URL / Generalizing tabular data structures with dplyr
dummy coding / Preparing data for use with k-NN, Step 3 – training a model on the data
dummy variable / Examining relationships – two-way cross-tabulations, Step 3 – training a model on the data

E

early stopping
- about / Pruning the decision tree
edgelist
- about / Analyzing and visualizing network data
elements
- about / Vectors
embarrassingly parallel problems
- about / Learning faster with parallel computing
ensemble methods
- bagging / Bagging
- boosting / Boosting
- random forests / Random forests
ensembles
- about / Understanding ensembles
- advantages / Understanding ensembles
entropy
- about / Choosing the best split
epoch
- about / Training neural networks with backpropagation
- forward phase / Training neural networks with backpropagation
- backward phase / Training neural networks with backpropagation
erosion / Simple linear regression
Euclidean norm / The case of linearly separable data
evaluation / Evaluation

F

10-fold cross-validation (10-fold CV) / Cross-validation
F-measure / The F-measure
F-score / The F-measure
F1 score / The F-measure
factor
- about / Factors
feedforward networks
- about / The direction of information travel
ffbase project
- URL / Creating disk-based data frames with ff
ff package
- used, for creating disk-based data frames / Creating disk-based data frames with ff
- URL / Creating disk-based data frames with ff
five-number summary / Measuring spread – quartiles and the five-number summary
foreach package
- using / Taking advantage of parallel with foreach and doParallel
frequently purchased groceries
- identifying, with association rules / Example – identifying frequently purchased groceries with association rules
future performance
- estimating / Estimating future performance
future performance estimation
- holdout method / The holdout method
- cross-validation / Cross-validation
- bootstrap sampling / Bootstrap sampling

G

Gaussian RBF kernel / Using kernels for non-linear spaces
generalization / Generalization
Generalized Linear Models (GLM) / Understanding regression
glyph / Step 1 – collecting data
GPU
- about / GPU computing
- computing / GPU computing
- URL / GPU computing
gradient descent / Training neural networks with backpropagation
Graph Modeling Language (GML)
- about / Analyzing and visualizing network data
greedy learners / What makes trees and rules greedy?
grid
- about / Learning faster with parallel computing

H

Hadoop
- using / Parallel cloud computing with MapReduce and Hadoop
- URL / Parallel cloud computing with MapReduce and Hadoop
harmonic mean / The F-measure
header line
- about / Importing and saving data from CSV files
histograms / Visualizing numeric variables – histograms
holdout method / The holdout method, Cross-validation
httr package
- URL / Downloading the complete text of web pages
hyperplane / Understanding Support Vector Machines
Hypertext Markup Language (HTML)
- about / Downloading the complete text of web pages

I

igraph package
- about / Analyzing and visualizing network data
- URL / Analyzing and visualizing network data
imputation / Data preparation – imputing the missing values
Incremental Reduced Error Pruning (IREP) algorithm / The RIPPER algorithm
independent events / Understanding joint probability
independent variables
- about / Understanding regression
information gain / Choosing the best split
input data
- types / Types of input data
- matching, to algorithms / Matching input data to algorithms
input nodes / The number of layers
instance-based learning
- about / Why is the k-NN algorithm lazy?
intercept
- about / Understanding regression
Interquartile Range (IQR) / Measuring spread – quartiles and the five-number summary
itemset
- about / Understanding association rules
Iterative Dichotomiser 3 (ID3) / The C5.0 decision tree algorithm

J

joint probability / Understanding joint probability
JSON
- parsing, from web APIs / Parsing JSON from web APIs
- about / Parsing JSON from web APIs
- URL / Parsing JSON from web APIs
jsonlite package
- URL / Parsing JSON from web APIs

K

k-fold cross-validation (or k-fold CV) / Cross-validation
k-means++ / Using distance to assign and update clusters
k-means clustering algorithm
- about / The k-means clustering algorithm
k-NN algorithm
- about / The k-NN algorithm
- weaknesses / The k-NN algorithm
- similarity, measuring with distance / Measuring similarity with distance
- appropriate k, selecting / Choosing an appropriate k
- data, preparing / Preparing data for use with k-NN
- lazy learning algorithm / Why is the k-NN algorithm lazy?
- used, for diagnosing breast cancer / Example – diagnosing breast cancer with the k-NN algorithm
kernels
- using, for non-linear spaces / Using kernels for non-linear spaces
kernel trick / Using kernels for non-linear spaces
kernlab
- reference / Step 3 – training a model on the data

L

Laplace estimator
- about / The Laplace estimator
large datasets
- managing / Managing very large datasets
- tabular data structures, generalizing with dplyr / Generalizing tabular data structures with dplyr
- data.table package, using / Making data frames faster with data.table
- disk-based data frames, creating with ff package / Creating disk-based data frames with ff
- massive matrices, using with bigmemory package / Using massive matrices with bigmemory
latitude / Using kernels for non-linear spaces
layers
- about / The number of layers
lazy learning algorithms / Why is the k-NN algorithm lazy?
leaf nodes
- about / Understanding decision trees
learning rate / Training neural networks with backpropagation
leave-one-out method / Cross-validation
left-hand side (LHS) / Understanding association rules
levels / Types of machine learning algorithms
LIBSVM
- URL / Step 3 – training a model on the data
likelihood
- about / Computing conditional probability with Bayes' theorem
linear kernel / Using kernels for non-linear spaces
link function / Understanding regression
lists / Lists
loess curve / Visualizing relationships among features – the scatterplot matrix
logistic regression
- about / Understanding regression
longitude / Using kernels for non-linear spaces

M

machine learning
- origins / The origins of machine learning
- about / The origins of machine learning
- abuses / Uses and abuses of machine learning
- uses / Uses and abuses of machine learning
- successes / Machine learning successes
- limitations / The limits of machine learning
- ethics / Machine learning ethics
- process / How machines learn
- with R / Machine learning with R
- R packages, installing / Installing R packages
- R packages, loading / Loading and unloading R packages
- R packages, unloading / Loading and unloading R packages
machine learning, in practice
- about / Machine learning in practice
- data collection / Machine learning in practice
- data exploration and preparation / Machine learning in practice
- model training / Machine learning in practice
- model evaluation / Machine learning in practice
- model improvement / Machine learning in practice
- input data, types / Types of input data
- algorithms, types / Types of machine learning algorithms
- input data, matching to algorithms / Matching input data to algorithms
machine learning, process
- about / How machines learn
- data storage / How machines learn, Data storage
- abstraction / How machines learn, Abstraction
- generalization / How machines learn, Generalization
- evaluation / How machines learn, Evaluation
machine learning algorithms
- types / Types of machine learning algorithms
magrittr package
- about / Scraping data from web pages
- URL / Scraping data from web pages
MapReduce
- about / Parallel cloud computing with MapReduce and Hadoop
- map step / Parallel cloud computing with MapReduce and Hadoop
- reduce step / Parallel cloud computing with MapReduce and Hadoop
marginal likelihood
- about / Computing conditional probability with Bayes' theorem
market basket analysis example
- data, collecting / Step 1 – collecting data
- data, preparing / Step 2 – exploring and preparing the data
- data, exploring / Step 2 – exploring and preparing the data
- sparse matrix, creating for transaction data / Data preparation – creating a sparse matrix for transaction data
- item support, visualizing / Visualizing item support – item frequency plots
- transaction data, visualizing / Visualizing the transaction data – plotting the sparse matrix
- model, training on data / Step 3 – training a model on the data
- model performance, evaluating / Step 4 – evaluating model performance
- model performance, improving / Step 5 – improving model performance
- set of association rules, sorting / Sorting the set of association rules
- subset of association rules, sorting / Taking subsets of association rules
- association rules, saving to file / Saving association rules to a file or data frame
- association rules, saving to data frame / Saving association rules to a file or data frame
matrix
- about / Matrixes and arrays
/ Matrixes and arrays
matrix notation / Multiple linear regression
maximum margin hyperplane (MMH) / Classification with hyperplanes
mean / Measuring the central tendency – mean and median
mean absolute error (MAE) / Measuring performance with the mean absolute error
medical expenses, predicting with linear regression
- about / Example – predicting medical expenses using linear regression
- data, collecting / Step 1 – collecting data
- data, preparing / Step 2 – exploring and preparing the data
- data, exploring / Step 2 – exploring and preparing the data
- correlation matrix / Exploring relationships among features – the correlation matrix
- relationships, visualizing among features / Visualizing relationships among features – the scatterplot matrix
- scatterplot matrix / Visualizing relationships among features – the scatterplot matrix
- model, training on data / Step 3 – training a model on the data
- model performance, training / Step 4 – evaluating model performance
- model performance, improving / Step 5 – improving model performance, Model specification – adding non-linear relationships, Transformation – converting a numeric variable to a binary indicator, Model specification – adding interaction effects, Putting it all together – an improved regression model
message-passing interface (MPI)
- about / Working in parallel with multicore and snow
meta-learners / Types of machine learning algorithms
meta-learning methods
- used, for improving model performance / Improving model performance with meta-learning
- about / Improving model performance with meta-learning
min-max normalization / Preparing data for use with k-NN
mobile phone spam
- filtering, with Naive Bayes algorithm / Example – filtering mobile phone spam with the Naive Bayes algorithm
mobile phone spam example
- data, collecting / Step 1 – collecting data
- dat a collecting, URL / Step 1 – collecting data
- data, preparing / Step 2 – exploring and preparing the data
- data, exploring / Step 2 – exploring and preparing the data
- text data, cleaning / Data preparation – cleaning and standardizing text data
- text data, standardizing / Data preparation – cleaning and standardizing text data
- text documents, splitting into words / Data preparation – splitting text documents into words
- training, creating / Data preparation – creating training and test datasets
- test datasets, creating / Data preparation – creating training and test datasets
- text data, visualizing / Visualizing text data – word clouds
- indicator features, creating for frequent words / Data preparation – creating indicator features for frequent words
- model, training on data / Step 3 – training a model on the data
- model performance, evaluating / Step 4 – evaluating model performance
- model performance, improving / Step 5 – improving model performance
model performance
- improving, with meta-learning / Improving model performance with meta-learning
model performance, breast cancer example
- z-score standardization / Transformation – z-score standardization
- alternatives values, testing of k / Testing alternative values of k
model trees / Understanding regression trees and model trees
multicore package
- using / Working in parallel with multicore and snow
multilayer network
- about / The number of layers
Multilayer Perceptron (MLP)
- about / The direction of information travel
multimodal / Measuring the central tendency – the mode
multinomial logistic regression / Understanding regression
multiple linear regression / Understanding regression
- about / Multiple linear regression
- weaknesses / Multiple linear regression
multiple R-squared value (coefficient of determination) / Step 4 – evaluating model performance
multivariate relationships
- about / Exploring relationships between variables

N

Naive Bayes algorithm
- about / Understanding Naive Bayes, The Naive Bayes algorithm
- classification / Classification with Naive Bayes
- Laplace estimator / The Laplace estimator
- numeric features, using with / Using numeric features with Naive Bayes
- used, for filtering mobile phone spam / Example – filtering mobile phone spam with the Naive Bayes algorithm
nearest neighbor classification
- about / Understanding nearest neighbor classification
network analysis
- about / Analyzing and visualizing network data
network data
- analyzing / Analyzing and visualizing network data
- visualizing / Analyzing and visualizing network data
network topology
- about / Network topology
- layers / The number of layers
- direction of information travel / The direction of information travel
- number of nodes in each layer / The number of nodes in each layer
neural networks
- about / Understanding neural networks
- biological, to artificial neurons / From biological to artificial neurons
- characteristics / From biological to artificial neurons
- training, with backpropagation / Training neural networks with backpropagation
neurons
- about / Understanding neural networks
nodes / Understanding neural networks
nominal / Types of input data
nominal variables
- about / Factors
non-linear spaces
- kernels, using for / Using kernels for non-linear spaces
normal distribution / Understanding numeric data – uniform and normal distributions
numeric / Types of input data
numeric data
- about / Understanding numeric data – uniform and normal distributions
- normalizing / Transformation – normalizing numeric data
numeric features
- using, with Naive Bayes / Using numeric features with Naive Bayes
numeric prediction / Types of machine learning algorithms
numeric variables
- about / Exploring numeric variables
- central tendency, measuring / Measuring the central tendency – mean and median
- spread, measuring / Measuring spread – quartiles and the five-number summary, Measuring spread – variance and standard deviation
- visualizing / Visualizing numeric variables – boxplots, Visualizing numeric variables – histograms

O

OCR, performing with SVMs
- about / Example – performing OCR with SVMs
- data, collecting / Step 1 – collecting data
- data, exploring / Step 2 – exploring and preparing the data
- data, preparing / Step 2 – exploring and preparing the data
- model, training on data / Step 3 – training a model on the data
- model performance, evaluating / Step 4 – evaluating model performance
- model performance, improving / Step 5 – improving model performance
one-way table / Exploring categorical variables
online data
- working with / Working with online data and services
- parsing / Working with online data and services
- complete text of web pages, downloading / Downloading the complete text of web pages
- parsing, within web pages / Scraping data from web pages
online services
- working with / Working with online data and services
Open Database Connectivity (ODBC)
- about / Querying data in SQL databases
optimized learning algorithms
- deploying / Deploying optimized learning algorithms
- regression models, building with biglm package / Building bigger regression models with biglm
- random forests, building with bigrf package / Growing bigger and faster random forests with bigrf
- models in parallel, evaluating with caret package / Training and evaluating models in parallel with caret
ordinal / Types of input data
ordinary least squares estimation
- about / Ordinary least squares estimation
out-of-bag error rate / Training random forests
overfitting / Evaluation

P

parallel cloud computing
- with MapReduce / Parallel cloud computing with MapReduce and Hadoop
- with Hadoop / Parallel cloud computing with MapReduce and Hadoop
parallel computing
- about / Learning faster with parallel computing
- execution time, measuring / Measuring execution time
- with multicore package / Working in parallel with multicore and snow
- with snow package / Working in parallel with multicore and snow
- with foreach package / Taking advantage of parallel with foreach and doParallel
- with doParallel package / Taking advantage of parallel with foreach and doParallel
parameter tuning
- about / Tuning stock models for better performance
pattern discovery / Types of machine learning algorithms
Pearson's correlation coefficient / Correlations
performance
- measuring, confusion matrices used / Using confusion matrices to measure performance
performance measures
- about / Beyond accuracy – other measures of performance
- kappa statistic / The kappa statistic
- sensitivity / Sensitivity and specificity
- specificity / Sensitivity and specificity
- precision / Precision and recall
performance tradeoffs
- -visualizing / Visualizing performance trade-offs
poisonous mushrooms
- identifying, with rule learners / Example – identifying poisonous mushrooms with rule learners
poisonous mushrooms example, with rule learners
- data, collecting / Step 1 – collecting data
- data, exploring / Step 2 – exploring and preparing the data
- data, preparing / Step 2 – exploring and preparing the data
- model, training on data / Step 3 – training a model on the data
- model performance, evaluating / Step 4 – evaluating model performance
- model performance, improving / Step 5 – improving model performance
Poisson regression
- about / Understanding regression
polynomial kernel / Using kernels for non-linear spaces
positive predictive value / Precision and recall
posterior probability
- about / Computing conditional probability with Bayes' theorem
postpruning
- about / Pruning the decision tree
pre-pruning
- about / Pruning the decision tree
precision / Precision and recall
predictive model / Types of machine learning algorithms
prior probability
- about / Computing conditional probability with Bayes' theorem
probability
- about / Understanding probability
proprietary files
- about / Working with proprietary files and databases
- Microsoft Excel files, reading / Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files
- Microsoft Excel files, writing / Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files
- SAS files, writing / Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files
- SAS files, reading / Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files
- SPSS files, reading / Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files
- SPSS files, writing / Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files
- Stata files, writing / Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files
- Stata files, reading / Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files
proprietary microarray
- using / Analyzing bioinformatics data
pure / Choosing the best split
purity / Choosing the best split

Q

quadratic optimization / The case of linearly separable data
quantiles / Measuring spread – quartiles and the five-number summary

R

1 R algorithm / The 1R algorithm
R
- about / Machine learning with R
- packages, installing / Installing R packages
- packages, loading / Loading and unloading R packages
- packages, unloading / Loading and unloading R packages
- data structures / R data structures
- used, for managing data / Managing data with R
- working with classification prediction data / Working with classification prediction data in R
R, performance improvement
- about / Improving the performance of R
- large datasets, managing / Managing very large datasets
- parallel computing / Learning faster with parallel computing
- GPU, computing / GPU computing
- optimized learning algorithms, deploying / Deploying optimized learning algorithms
R-squared value / Step 4 – evaluating model performance
Radial Basis Function (RBF) network
- about / Activation functions
random forests
- about / Random forests
- URL / Random forests
- strengths / Random forests
- training / Training random forests
- performance, evaluating / Evaluating random forest performance
- building, with bigrf package / Growing bigger and faster random forests with bigrf
RCurl
- URL / Downloading the complete text of web pages
rea under the ROC curve (AUC) / ROC curves
Receiver Operating Characteristic (ROC) curve
- about / ROC curves
- creating / ROC curves
recurrent network
- about / The direction of information travel
recursive partitioning
- about / Divide and conquer
regression
- about / Understanding regression
- simple linear regression / Simple linear regression
- ordinary least squares estimation / Ordinary least squares estimation
- correlation / Correlations
- multiple linear regression / Multiple linear regression
- adding, to trees / Adding regression to trees
regression analysis
- use cases / Understanding regression
regression equations
- about / Understanding regression
regression models
- building, with biglm package / Building bigger regression models with biglm
regression trees
- about / Understanding regression trees and model trees
relationships
- exploring, between variables / Exploring relationships between variables
- visualizing / Visualizing relationships – scatterplots
- examining / Examining relationships – two-way cross-tabulations
Repeated Incremental Pruning to Produce Error Reduction (RIPPER) algorithm / The RIPPER algorithm
residuals / Ordinary least squares estimation
resubstitution error / Estimating future performance
Revolution Analytics
- URL / Taking advantage of parallel with foreach and doParallel
RHadoop
- URL / Parallel cloud computing with MapReduce and Hadoop
RHIPE package
- URL / Parallel cloud computing with MapReduce and Hadoop
rio package
- URL / Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files
- about / Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files
RIPPER algorithm
- about / The RIPPER algorithm
risky bank loans
- identifying, C5.0 decision trees used / Example – identifying risky bank loans using C5.0 decision trees
rote learning
- about / Why is the k-NN algorithm lazy?
rpart.plot
- URL / Visualizing decision trees
rudimentary ANNs / Understanding neural networks
rvest package
- about / Scraping data from web pages

S

scatterplot
- about / Visualizing relationships – scatterplots
scatterplot matrix (SPLOM) / Visualizing relationships among features – the scatterplot matrix
Scoville scale / Preparing data for use with k-NN
segmentation analysis / Types of machine learning algorithms
semi-supervised learning / Clustering as a machine learning task
separate and conquer
- about / Separate and conquer
sigmoid kernel / Using kernels for non-linear spaces
simple linear regression / Understanding regression
- about / Simple linear regression
simple tuned model
- creating / Creating a simple tuned model
slack variable / The case of nonlinearly separable data
slope
- about / Understanding regression
slope-intercept form
- about / Understanding regression
SMS Spam Collection
- URL / Step 1 – collecting data
snowball
- URL / Data preparation – cleaning and standardizing text data
snow package
- using / Working in parallel with multicore and snow
- URL / Working in parallel with multicore and snow
social networking service (SNS) / Example – finding teen market segments using k-means clustering
sparse matrix / Data preparation – splitting text documents into words, Data preparation – creating a sparse matrix for transaction data
SQL databases
- data, querying in / Querying data in SQL databases
squashing functions / Activation functions
stacking
- about / Understanding ensembles
standard deviation
- about / Measuring spread – variance and standard deviation
standard deviation reduction (SDR) / Adding regression to trees
statistical hypothesis testing / Understanding regression
stock models
- tuning, for better performance / Tuning stock models for better performance
Structured Query Language (SQL)
- about / Querying data in SQL databases
subtree raising / Pruning the decision tree
subtree replacement / Pruning the decision tree
summary statistics / Exploring numeric variables
supervised learning / Types of machine learning algorithms
Support Vector Machine (SVM)
- about / Understanding Support Vector Machines
- applications / Understanding Support Vector Machines
- classifications, with hyperplanes / Classification with hyperplanes
- case of linearly separable data / The case of linearly separable data
- case of nonlinearly separable data / The case of nonlinearly separable data
- OCR, performing with / Example – performing OCR with SVMs
/ Bagging
support vectors / Classification with hyperplanes
SVMlight
- about / Step 3 – training a model on the data
- URL / Step 3 – training a model on the data
synapse
- about / From biological to artificial neurons

T

Tab-Separated Value (TSV)
- about / Importing and saving data from CSV files
tabular
- about / Importing and saving data from CSV files
tabular data structures
- generalizing, with dplyr package / Generalizing tabular data structures with dplyr
teen market segments search, with k-means clustering
- about / Example – finding teen market segments using k-means clustering
- data, collecting / Step 1 – collecting data
- data, exploring / Step 2 – exploring and preparing the data
- data, preparing / Step 2 – exploring and preparing the data, Data preparation – dummy coding missing values, Data preparation – imputing the missing values
- model, training on data / Step 3 – training a model on the data
- model performance, evaluating / Step 4 – evaluating model performance
- model performance, improving / Step 5 – improving model performance
terminal nodes / Understanding decision trees
threshold activation function / Activation functions
training / Abstraction
trees
- regression, adding to / Adding regression to trees
tree structure
- about / Understanding decision trees
tuning process
- customizing / Customizing the tuning process
two-way cross-tabulation
- about / Examining relationships – two-way cross-tabulations

U

UCI Machine Learning Data Repository
- URL / Step 1 – collecting data, Step 1 – collecting data, Step 1 – collecting data
- about / Step 1 – collecting data
unimodal / Measuring the central tendency – the mode
unit of analysis / Types of input data
unit of observation / Types of input data
unit step activation function / Activation functions
univariate statistics
- about / Exploring relationships between variables
universal function approximator / The number of nodes in each layer
unsupervised learning / Types of machine learning algorithms

V

vector
- about / Vectors
vector types
- types / Vectors
Voronoi diagram / Using distance to assign and update clusters

W

web pages
- complete text, downloading / Downloading the complete text of web pages
- data, parsing / Scraping data from web pages
- XML documents, parsing / Parsing XML documents
- JSON, parsing from web APIs / Parsing JSON from web APIs
web scraping
- about / Scraping data from web pages
wine quality estimation, with regression trees
- about / Example – estimating the quality of wines with regression trees and model trees
- data, collecting / Step 1 – collecting data
- data, preparing / Step 2 – exploring and preparing the data
- data, exploring / Step 2 – exploring and preparing the data
- model, training on data / Step 3 – training a model on the data
- decision trees, visualizing / Visualizing decision trees
- model performance, evaluating / Step 4 – evaluating model performance
- performance, measuring with mean absolute error / Measuring performance with the mean absolute error
- model performance, improving / Step 5 – improving model performance
word cloud
- about / Visualizing text data – word clouds
wordcloud package
- URL / Visualizing text data – word clouds

X

xml2 GitHub
- URL / Parsing XML documents
XML documents
- parsing / Parsing XML documents
XML package
- about / Parsing XML documents
- URL / Parsing XML documents

Z

z-score / Preparing data for use with k-NN
z-score standardization / Preparing data for use with k-NN, Transformation – z-score standardization
ZeroR / The 1R algorithm