Book Image

Machine Learning with R - Third Edition

By : Brett Lantz

Book Image

Machine Learning with R - Third Edition

By: Brett Lantz

Overview of this book

Machine learning, at its core, is concerned with transforming data into actionable knowledge. R offers a powerful set of machine learning methods to quickly and easily gain insight from your data. Machine Learning with R, Third Edition provides a hands-on, readable guide to applying machine learning to real-world problems. Whether you are an experienced R user or new to the language, Brett Lantz teaches you everything you need to uncover key insights, make new predictions, and visualize your findings. This new 3rd edition updates the classic R data science book to R 3.6 with newer and better libraries, advice on ethical and bias issues in machine learning, and an introduction to deep learning. Find powerful new insights in your data; discover machine learning with R.

Machine Learning with R - Third Edition

Machine Learning with R - Third Edition

Contributors

Preface

Other Books You May Enjoy

Other Books You May Enjoy

Leave a review - let other readers know what you think

Leave a review - let other readers know what you think

Free Chapter

Introducing Machine Learning

Introducing Machine Learning

The origins of machine learning

Uses and abuses of machine learning

How machines learn

Machine learning in practice

Machine learning with R

Managing and Understanding Data

Managing and Understanding Data

R data structures

Managing data with R

Exploring and understanding data

Lazy Learning – Classification Using Nearest Neighbors

Lazy Learning – Classification Using Nearest Neighbors

Understanding nearest neighbor classification

Example – diagnosing breast cancer with the k-NN algorithm

Probabilistic Learning – Classification Using Naive Bayes

Probabilistic Learning – Classification Using Naive Bayes

Understanding Naive Bayes

Example – filtering mobile phone spam with the Naive Bayes algorithm

Divide and Conquer – Classification Using Decision Trees and Rules

Divide and Conquer – Classification Using Decision Trees and Rules

Understanding decision trees

Example – identifying risky bank loans using C5.0 decision trees

Understanding classification rules

Example – identifying poisonous mushrooms with rule learners

Forecasting Numeric Data – Regression Methods

Forecasting Numeric Data – Regression Methods

Understanding regression

Example – predicting medical expenses using linear regression

Understanding regression trees and model trees

Example – estimating the quality of wines with regression trees and model trees

Black Box Methods – Neural Networks and Support Vector Machines

Black Box Methods – Neural Networks and Support Vector Machines

Understanding neural networks

Example – modeling the strength of concrete with ANNs

Understanding support vector machines

Example – performing OCR with SVMs

Finding Patterns – Market Basket Analysis Using Association Rules

Finding Patterns – Market Basket Analysis Using Association Rules

Understanding association rules

Example – identifying frequently purchased groceries with association rules

Finding Groups of Data – Clustering with k-means

Finding Groups of Data – Clustering with k-means

Understanding clustering

Finding teen market segments using k-means clustering

Evaluating Model Performance

Evaluating Model Performance

Measuring performance for classification

Estimating future performance

Improving Model Performance

Improving Model Performance

Tuning stock models for better performance

Improving model performance with meta-learning

Specialized Machine Learning Topics

Specialized Machine Learning Topics

Managing and preparing real-world data

Working with online data and services

Working with domain-specific data

Improving the performance of R

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Index

A

activation function / From biological to artificial neurons, Activation functions
AdaBoost / Boosting
AdaBoost.M1 algorithm / Boosting
adaptive boosting / Boosting the accuracy of decision trees, Boosting
adversarial learning / Types of machine learning algorithms
algorithms
- input data, matching to / Matching input data to algorithms
allocation function
- about / Understanding ensembles
Amazon Web Services (AWS) / Step 5 – improving model performance
ANNs, used for modeling concrete strength
- about / Example – modeling the strength of concrete with ANNs
- data collection / Step 1 – collecting data
- data exploration / Step 2 – exploring and preparing the data
- data preparation / Step 2 – exploring and preparing the data
- model, training on data / Step 3 – training a model on the data
- model performance, evaluating / Step 4 – evaluating model performance
- model performance, improving / Step 5 – improving model performance
antecedent
- about / Understanding classification rules
Apache Hadoop
- about / Parallel cloud computing with MapReduce and Hadoop
Apache Spark
- parallel cloud computing / Parallel cloud computing with Apache Spark
application programming interfaces (API) / Parsing JSON from web APIs
Apriori / The Apriori algorithm for association rule learning
Apriori algorithm
- for association rule learning / The Apriori algorithm for association rule learning
- strengths / The Apriori algorithm for association rule learning
- weaknesses / The Apriori algorithm for association rule learning
Apriori principle
- set of rules, building / Building a set of rules with the Apriori principle
Apriori property / The Apriori algorithm for association rule learning
area under the ROC curve (AUC) / Visualizing performance tradeoffs with ROC curves
arrays
- about / Matrices and arrays
artificial neural network (ANN)
- about / Understanding neural networks
artificial neurons
- about / From biological to artificial neurons
association rules
- about / Understanding association rules
- left-hand side (LHS) / Understanding association rules
- right-hand side (RHS) / Understanding association rules
- applications / Understanding association rules
- rule interest, measuring / Measuring rule interest – support and confidence
automated parameter tuning
- caret, using for / Using caret for automated parameter tuning
axis-parallel splits / Divide and conquer
axon / From biological to artificial neurons

B

0.632 bootstrap / Bootstrap sampling
backpropagation
- neural networks, training / Training neural networks with backpropagation
- about / Training neural networks with backpropagation
bag-of-words / Step 2 – exploring and preparing the data
bagging / Bagging
Bayes' theorem
- conditional probability, computing / Computing conditional probability with Bayes' theorem
Bayesian classifiers
- uses / Understanding Naive Bayes
Bayesian methods
- about / Understanding Naive Bayes
- concepts / Basic concepts of Bayesian methods
Beowulf cluster
- about / Working in parallel with multicore and snow
betweenness centrality / Analyzing and visualizing network data
bias-variance tradeoff / Choosing an appropriate k
big data / The origins of machine learning
biglm
- bigger regression models, building / Building bigger regression models with biglm
bigmemory package
- massive matrices, using with / Using massive matrices with bigmemory
- reference / Using massive matrices with bigmemory
bigrf
- massive random forests, growing / Growing massive random forests with bigrf
- reference / Growing massive random forests with bigrf
bimodal / Measuring the central tendency – the mode
binning / Using numeric features with Naive Bayes
bins / Using numeric features with Naive Bayes
Bioconductor project
- reference / Analyzing bioinformatics data
bioinformatics data
- analyzing / Analyzing bioinformatics data
biological neurons
- about / From biological to artificial neurons
bits / Choosing the best split
bivariate relationships / Exploring relationships between variables
body mass index (BMI) / Step 1 – collecting data
boosting / Boosting
bootstrap aggregating / Bagging
bootstrap sampling / Bootstrap sampling
box-and-whisker plot / Visualizing numeric variables – boxplots
boxplot
- visualizing / Visualizing numeric variables – boxplots
branches
- about / Understanding decision trees
breast cancer diagnose, with k-NN algorithm
- about / Example – diagnosing breast cancer with the k-NN algorithm
- data collection / Step 1 – collecting data
- data exploration / Step 2 – exploring and preparing the data
- data preparation / Step 2 – exploring and preparing the data
- numeric data, normalizing / Transformation – normalizing numeric data
- training dataset, creating / Data preparation – creating training and test datasets
- test dataset, creating / Data preparation – creating training and test datasets
- model, training on data / Step 3 – training a model on the data
- model performance, evaluating / Step 4 – evaluating model performance
- model performance, improving / Step 5 – improving model performance
- z-score standardization / Transformation – z-score standardization
- alternative values of k, testing / Testing alternative values of k

C

C5.0 decision tree algorithm
- about / The C5.0 decision tree algorithm
- strengths / The C5.0 decision tree algorithm
- weaknesses / The C5.0 decision tree algorithm
- best split, selecting / Choosing the best split
C5.0 decision trees, used for identifying risky bank loans
- about / Example – identifying risky bank loans using C5.0 decision trees
- data collection / Step 1 – collecting data
- data exploration / Step 2 – exploring and preparing the data
- data preparation / Step 2 – exploring and preparing the data
- training dataset, creating / Data preparation – creating random training and test datasets
- test dataset, creating / Data preparation – creating random training and test datasets
- model, training on data / Step 3 – training a model on the data
- model performance, evaluating / Step 4 – evaluating model performance
- model performance, improving / Step 5 – improving model performance
caret package / Beyond accuracy – other measures of performance
- using, for automated parameter tuning / Using caret for automated parameter tuning
- models, training in parallel / Training and evaluating models in parallel with caret
- models, evaluating in parallel / Training and evaluating models in parallel with caret
categorical data / Types of input data
categorical variables
- exploring / Exploring categorical variables
cell body / From biological to artificial neurons
central processing unit (CPU) / Data storage
central tendency
- measuring / Measuring the central tendency – mean and median, Measuring the central tendency – the mode
chi-squared statistic / Choosing the best split
class-conditional independence / Classification with Naive Bayes
classification / Types of machine learning algorithms
- with Naive Bayes / Classification with Naive Bayes
- performance, measuring for / Measuring performance for classification
classification, with hyperplanes
- about / Classification with hyperplanes
- case of linearly separable data / The case of linearly separable data
- case of non-linearly separable data / The case of nonlinearly separable data
classification and regression tree (CART)
- about / Understanding regression trees and model trees
classification rules
- about / Understanding classification rules
- separate and conquer / Separate and conquer
- 1R algorithm / The 1R algorithm
- RIPPER algorithm / The RIPPER algorithm
classifier
- predictions / Understanding a classifier's predictions
class imbalance problem / Measuring performance for classification
clustering / Types of machine learning algorithms
- about / Understanding clustering
- as machine learning task / Clustering as a machine learning task
clusters
- about / Understanding clustering
Cohen's kappa coefficient / The kappa statistic
combination function / Understanding ensembles
comma-separated values (CSV) / Importing and saving data from CSV files
complement / Understanding probability
complete text of web pages
- downloading / Downloading the complete text of web pages
Complete Unified Device Architecture (CUDA)
- about / GPU computing
Comprehensive R Archive Network (CRAN)
- reference / Machine learning with R
conditional probability
- computing, with Bayes' theorem / Computing conditional probability with Bayes' theorem
- about / Computing conditional probability with Bayes' theorem
confusion matrix / Making some mistakes cost more than others
- about / Measuring performance for classification, A closer look at confusion matrices
- used, for measuring performance / Using confusion matrices to measure performance
consequent
- about / Understanding classification rules
contingency table / Examining relationships – two-way cross-tabulations
control object / Customizing the tuning process
convex hull / The case of linearly separable data
corpus / Data preparation – cleaning and standardizing text data
correlation / Visualizing relationships – scatterplots
- about / Correlations
correlation ellipse / Visualizing relationships among features – the scatterplot matrix
correlation matrix / Exploring relationships among features – the correlation matrix
cost matrix / Making some mistakes cost more than others
covariance function / Ordinary least squares estimation
covering algorithms
- about / Separate and conquer
CRAN task view, for clustering
- reference / The k-means clustering algorithm
CRAN Web Technologies and Services task view
- reference / Working with online data and services
cross-validation / Cross-validation
crosstab / Examining relationships – two-way cross-tabulations
CSV files
- data, importing from / Importing and saving data from CSV files
- data, saving from / Importing and saving data from CSV files
Cubist algorithm / Step 5 – improving model performance
cut points / Using numeric features with Naive Bayes

D

data
- managing, with R / Managing data with R
- importing, from CSV files / Importing and saving data from CSV files
- saving, from CSV files / Importing and saving data from CSV files
- exploring / Exploring and understanding data
- structure / Exploring the structure of data
- querying, in SQL databases / Querying data in SQL databases
- parsing, within web pages / Parsing the data within web pages
database backend
- using, with dplyr / Using a database backend with dplyr
database connections
- managing / The tidy approach to managing database connections
database management system (DBMS) / Querying data in SQL databases
data frames
- about / Data frames
data mining
- about / The origins of machine learning
data munging / Managing and preparing real-world data
data preparation
- speeding up, with dplyr / Speeding and simplifying data preparation with dplyr
- simplifying, with dplyr / Speeding and simplifying data preparation with dplyr
data source name (DSN) / The tidy approach to managing database connections
data structures, R
- about / R data structures
- vectors / Vectors
- factors / Factors
- lists / Lists
- data frames / Data frames
- matrices / Matrices and arrays
- arrays / Matrices and arrays
- saving / Saving, loading, and removing R data structures
- loading / Saving, loading, and removing R data structures
- removing / Saving, loading, and removing R data structures
data table
- used, for making data frames faster / Making data frames faster with data.table
- reference / Making data frames faster with data.table
data wrangling / Managing and preparing real-world data
deciles / Measuring spread – quartiles and the five-number summary
decision nodes
- about / Understanding decision trees
decision tree
- pruning / Pruning the decision tree
decision tree algorithms
- benefits / Understanding decision trees
decision tree forests / Random forests
decision trees
- about / Understanding decision trees
- divide and conquer approach / Divide and conquer
- accuracy, boosting of / Boosting the accuracy of decision trees
- rules / Rules from decision trees
deep learning / The direction of information travel
- with Keras / An interface for deep learning with Keras
deep neural network (DNN) / The direction of information travel
delimiter / Importing and saving data from CSV files
dendrites / From biological to artificial neurons
dependencies / Installing R packages
dependent events / Understanding joint probability
dependent variable / Visualizing relationships – scatterplots
- about / Understanding regression
descriptive model / Types of machine learning algorithms
disk-based data frames
- creating, with ff package / Creating disk-based data frames with ff
distance function / Measuring similarity with distance
divide and conquer approach
- about / Divide and conquer
document-term matrix (DTM) / Data preparation – splitting text documents into words
domain-specific data
- working with / Working with domain-specific data
doParallel package
- about / Taking advantage of parallel with foreach and doParallel
dot product / Using kernels for nonlinear spaces
dplyr package
- data prepartaion, speeding up / Speeding and simplifying data preparation with dplyr
- data prepartaion, simplifying / Speeding and simplifying data preparation with dplyr
- database backend, using with / Using a database backend with dplyr
dummy coding / Preparing data for use with k-NN

E

early stopping / Pruning the decision tree
edgelist / Analyzing and visualizing network data
edges / Analyzing and visualizing network data
elbow method / Choosing the appropriate number of clusters
elbow point / Choosing the appropriate number of clusters
ensemble methods
- about / Understanding ensembles
- bagging / Bagging
- boosting / Boosting
- random forests / Random forests
ensembles / Types of machine learning algorithms
- about / Understanding ensembles
- performance advantages / Understanding ensembles
entropy / Choosing the best split
epoch
- about / Training neural networks with backpropagation
epoch, backpropagation algorithm
- forward phase / Training neural networks with backpropagation
- backward phase / Training neural networks with backpropagation
error rate / Using confusion matrices to measure performance
Euclidean distance / Measuring similarity with distance
Euclidean norm / The case of linearly separable data
event
- about / Basic concepts of Bayesian methods
exhaustive event / Understanding probability
exploding gradient problem / Step 5 – improving model performance
external data files
- reading / Reading and writing to external data files
- writing to / Reading and writing to external data files

F

F-measure / The F-measure
F-score / The F-measure
factors
- about / Factors
feedback network / The direction of information travel
feedforward networks / The direction of information travel
ffbase project
- reference / Creating disk-based data frames with ff
ff package
- disk-based data frames, creating / Creating disk-based data frames with ff
- reference / Creating disk-based data frames with ff
five-number summary / Measuring spread – quartiles and the five-number summary
folds / Cross-validation
foreach package
- about / Taking advantage of parallel with foreach and doParallel
frequency table / Computing conditional probability with Bayes' theorem
frequently purchased groceries, identifying with association rules
- about / Example – identifying frequently purchased groceries with association rules
- data collection / Step 1 – collecting data
- data preparation / Step 2 – exploring and preparing the data
- data exploration / Step 2 – exploring and preparing the data
- sparse matrix, creating for transaction data / Data preparation – creating a sparse matrix for transaction data
- item support, visualizing / Visualizing item support – item frequency plots
- model, training on data / Step 3 – training a model on the data
- model performance, evaluating / Step 4 – evaluating model performance
- model performance, improving / Step 5 – improving model performance
- set of association rules, sorting / Sorting the set of association rules
- subsets, taking of association rules / Taking subsets of association rules
- association rules, saving to file/data frame / Saving association rules to a file or data frame
future performance
- estimating / Estimating future performance

G

gain ratio / Choosing the best split
Gaussian RBF kernel / Using kernels for nonlinear spaces
General Data Protection Regulation (GDPR) / Machine learning ethics
generalized linear models (GLM)
- about / Understanding regression
Gini index / Choosing the best split
glyph / Step 1 – collecting data
Google bombing / Machine learning ethics
GPU computing
- about / GPU computing
gradient descent
- about / Training neural networks with backpropagation
Graph Modeling Language (GML) / Analyzing and visualizing network data
greedy learners / What makes trees and rules greedy?

H

H2O Flow
- about / A faster machine learning computing engine with H2O
H2O project
- about / A faster machine learning computing engine with H2O
Hadoop
- parallel cloud computing / Parallel cloud computing with MapReduce and Hadoop
harmonic mean / The F-measure
heuristics / Generalization
hidden layers / The number of layers
histograms
- visualizing / Visualizing numeric variables – histograms
holdout method / The holdout method
httr
- reference / Downloading the complete text of web pages
hyperplane / Understanding support vector machines
- using, in classification / Classification with hyperplanes
Hypertext Markup Language (HTML) / Downloading the complete text of web pages
hypothesis testing / Understanding regression

I

igraph package
- reference / Analyzing and visualizing network data
image processing / Example – performing OCR with SVMs
imputation / Data preparation – imputing the missing values
Incremental Reduced Error Pruning (IREP) algorithm / The RIPPER algorithm
independent events / Understanding joint probability
independent variables
- about / Understanding regression
information gain / Choosing the best split
input data
- matching, to algorithms / Matching input data to algorithms
instance-based learning / Why is the k-NN algorithm lazy?
intercept / Understanding regression
interquartile range (IQR) / Measuring spread – quartiles and the five-number summary
Interrater Reliability (irr) package / The kappa statistic
intersection / Understanding joint probability
item frequency plots / Visualizing item support – item frequency plots
itemset
- about / Understanding association rules
Iterative Dichotomiser 3 (ID3) algorithm / The C5.0 decision tree algorithm

J

J48 / The C5.0 decision tree algorithm
Java
- download link / Installing R packages
JavaScript Object Notation (JSON) / Parsing JSON from web APIs
joint probability
- about / Understanding joint probability
JSON
- parsing, from web APIs / Parsing JSON from web APIs
- reference / Parsing JSON from web APIs
jsonlite package
- reference / Parsing JSON from web APIs

K

k-fold cross-validation (k-fold CV) / Cross-validation
k-means++ algorithm / Using distance to assign and update clusters
k-means algorithm
- about / The k-means clustering algorithm
- strengths / The k-means clustering algorithm
- weaknesses / The k-means clustering algorithm
- distance, used for assigning clusters / Using distance to assign and update clusters
- distance, used for updating clusters / Using distance to assign and update clusters
- appropriate number of clusters, selecting / Choosing the appropriate number of clusters
k-means clustering, used for finding teen market segments
- about / Finding teen market segments using k-means clustering
- data collection / Step 1 – collecting data
- data exploration / Step 2 – exploring and preparing the data
- data preparation / Step 2 – exploring and preparing the data
- missing values, dummy coding / Data preparation – dummy coding missing values
- missing values, imputing / Data preparation – imputing the missing values
- model, training on data / Step 3 – training a model on the data
- model performance, evaluating / Step 4 – evaluating model performance
- model performance, improving / Step 5 – improving model performance
k-nearest neighbors (k-NN) algorithm / The k-means clustering algorithm
k-NN algorithm
- strengths / The k-NN algorithm
- weaknesses / The k-NN algorithm
- about / The k-NN algorithm
- example / The k-NN algorithm
- similarity, measuring with distance / Measuring similarity with distance
- appropriate k, selecting / Choosing an appropriate k
- data, preparing for usage with / Preparing data for use with k-NN
- lazy learning algorithm / Why is the k-NN algorithm lazy?
kappa statistic / The kappa statistic
Keras
- reference / An interface for deep learning with Keras
- deep learning / An interface for deep learning with Keras
kernels
- using, for non-linear spaces / Using kernels for nonlinear spaces
kernel trick / Using kernels for nonlinear spaces
kernlab
- reference / Step 3 – training a model on the data
knowledge representation / Abstraction

L

Laplace estimator / The Laplace estimator
leaf nodes
- about / Understanding decision trees
learning rate
- about / Training neural networks with backpropagation
leave-one-out method / Cross-validation
levels / Types of machine learning algorithms
libstemmer library / Data preparation – cleaning and standardizing text data
LIBSVM
- reference / Step 3 – training a model on the data
likelihood / Computing conditional probability with Bayes' theorem
likelihood table / Computing conditional probability with Bayes' theorem
linear kernel / Using kernels for nonlinear spaces
link function
- about / Understanding regression
links / Analyzing and visualizing network data
lists
- about / Lists
LOESS curve / Visualizing relationships among features – the scatterplot matrix
logistic regression
- about / Understanding regression

M

machine learning
- origins / The origins of machine learning
- successes / Uses and abuses of machine learning, Machine learning successes
- limits / The limits of machine learning
- ethics / Machine learning ethics
- about / How machines learn
- data storage / How machines learn, Data storage
- abstraction / How machines learn, Abstraction
- generalization / How machines learn, Generalization
- evaluation / How machines learn, Evaluation
- working / Machine learning in practice
- data collection / Machine learning in practice
- data exploration / Machine learning in practice
- data preparation / Machine learning in practice
- model training / Machine learning in practice
- model evaluation / Machine learning in practice
- model improvement / Machine learning in practice
- input data / Types of input data
- with R / Machine learning with R
machine learning algorithms
- types / Types of machine learning algorithms
magrittr package
- reference / Speeding and simplifying data preparation with dplyr
Manhattan distance / Measuring similarity with distance
MapReduce
- about / Parallel cloud computing with MapReduce and Hadoop
- map step / Parallel cloud computing with MapReduce and Hadoop
- reduce step / Parallel cloud computing with MapReduce and Hadoop
- parallel cloud computing / Parallel cloud computing with MapReduce and Hadoop
marginal likelihood / Computing conditional probability with Bayes' theorem
market basket analysis / Types of machine learning algorithms
massive matrices
- using, with bigmemory package / Using massive matrices with bigmemory
matrix
- about / Matrices and arrays
matrix format data / Types of input data
matrix inverse / Multiple linear regression
matrix notation / Multiple linear regression
maximum margin hyperplane (MMH) / Classification with hyperplanes
mean / Measuring the central tendency – mean and median
mean absolute error (MAE) / Measuring performance with the mean absolute error
median / Measuring the central tendency – mean and median
medical expenses, predicting with linear regression
- about / Example – predicting medical expenses using linear regression
- data collection / Step 1 – collecting data
- data preparation / Step 2 – exploring and preparing the data
- data exploration / Step 2 – exploring and preparing the data
- relationships, exploring among features / Exploring relationships among features – the correlation matrix
- relationships, visualizing among features / Visualizing relationships among features – the scatterplot matrix
- model, training on data / Step 3 – training a model on the data
- model performance, evaluating / Step 4 – evaluating model performance
- model performance, improving / Step 5 – improving model performance
- model specification / Model specification – adding nonlinear relationships
- non-linear relationships, adding / Model specification – adding nonlinear relationships
- numeric variable, converting to binary indicator / Transformation – converting a numeric variable to a binary indicator
- transformation / Transformation – converting a numeric variable to a binary indicator
- interaction effects, adding / Model specification – adding interaction effects
- improved regression model / Putting it all together – an improved regression model
- predictions, making with regression model / Making predictions with a regression model
message passing interface (MPI)
- about / Working in parallel with multicore and snow
meta-learners / Types of machine learning algorithms
meta-learning
- model performance, improving with / Improving model performance with meta-learning
microarray / Analyzing bioinformatics data
Microsoft Azure / Step 5 – improving model performance
Microsoft Excel files
- importing, with rio / Importing Microsoft Excel, SAS, SPSS, and Stata files with rio
min-max normalization / Preparing data for use with k-NN
mobile phone filtering, with Naive Bayes algorithm
- about / Example – filtering mobile phone spam with the Naive Bayes algorithm
- data collection / Step 1 – collecting data
- data exploration / Step 2 – exploring and preparing the data
- data preparation / Step 2 – exploring and preparing the data
- text data, cleaning / Data preparation – cleaning and standardizing text data
- text data, standardizing / Data preparation – cleaning and standardizing text data
- text documents, splitting into words / Data preparation – splitting text documents into words
- training dataset, creating / Data preparation – creating training and test datasets
- test dataset, creating / Data preparation – creating training and test datasets
- text data, visualizing / Visualizing text data – word clouds
- indicator features, creating for frequent words / Data preparation – creating indicator features for frequent words
- model, training on data / Step 3 – training a model on the data
- model performance, evaluating / Step 4 – evaluating model performance
- model performance, improving / Step 5 – improving model performance
model performance
- improving, with meta-learning / Improving model performance with meta-learning
model trees
- about / Understanding regression trees and model trees
multicore package
- about / Working in parallel with multicore and snow
multilayer network / The number of layers
multilayer perceptron (MLP) / The direction of information travel
multimodal / Measuring the central tendency – the mode
multinomial logistic regression
- about / Understanding regression
multiple linear regression
- about / Understanding regression, Multiple linear regression
- strengths / Multiple linear regression
- weaknesses / Multiple linear regression
multiple regression
- about / Understanding regression
multivariate relationships / Exploring relationships between variables
mutually exclusive event / Understanding probability

N

Naive Bayes
- about / Understanding Naive Bayes
- using, in classification / Classification with Naive Bayes
- numeric features, using with / Using numeric features with Naive Bayes
Naive Bayes algorithm
- about / The Naive Bayes algorithm
- strengths / The Naive Bayes algorithm
- weaknesses / The Naive Bayes algorithm
nearest neighbor classification
- about / Understanding nearest neighbor classification
- k-NN algorithm / The k-NN algorithm
negative class predictions / A closer look at confusion matrices
network analysis / Analyzing and visualizing network data
network data
- analyzing / Analyzing and visualizing network data
- visualizing / Analyzing and visualizing network data
network topology / From biological to artificial neurons
- about / Network topology
- number of layers / The number of layers
- direction of information travel / The direction of information travel
- number of node, in each layer / The number of nodes in each layer
neural networks
- characteristics / From biological to artificial neurons
- training, with backpropagation / Training neural networks with backpropagation
neurons
- about / Understanding neural networks
nodes
- about / Understanding neural networks
/ Analyzing and visualizing network data
No Free Lunch theorem
- reference / Evaluation
nominal data / Types of input data
non-linear spaces
- kernels, using for / Using kernels for nonlinear spaces
non-parametric learning methods / Why is the k-NN algorithm lazy?
normal distribution / Understanding numeric data – uniform and normal distributions
numeric data / Types of input data
numeric features
- using, with Naive Bayes / Using numeric features with Naive Bayes
numeric prediction / Types of machine learning algorithms
numeric variables
- exploring / Exploring numeric variables

O

OCR, performing with SVMs
- about / Example – performing OCR with SVMs
- data collection / Step 1 – collecting data
- data exploration / Step 2 – exploring and preparing the data
- data preparation / Step 2 – exploring and preparing the data
- model, training on data / Step 3 – training a model on the data
- model performance, evaluating / Step 4 – evaluating model performance
- model performance, improving / Step 5 – improving model performance
- SVM kernel function, modifying / Changing the SVM kernel function
- best SVM cost parameter, identifying / Identifying the best SVM cost parameter
one-way table / Exploring categorical variables
one hot encoding / Preparing data for use with k-NN
online data
- working with / Working with online data and services
online services
- working with / Working with online data and services
Open Database Connectivity (ODBC) / The tidy approach to managing database connections
optical character recognition (OCR) / Example – performing OCR with SVMs
optimized learning algorithms
- deploying / Deploying optimized learning algorithms
ordinal / Types of input data
ordinary least squares (OLS)
- about / Ordinary least squares estimation
out-of-bag error rate / Training random forests
overfitting / Evaluation

P

parallel cloud computing
- with MapReduce / Parallel cloud computing with MapReduce and Hadoop
- with Hadoop / Parallel cloud computing with MapReduce and Hadoop
- with Apache Spark / Parallel cloud computing with Apache Spark
parallel computing
- about / Learning faster with parallel computing
- execution time, measuring / Measuring execution time
parallel package
- about / Working in parallel with multicore and snow
parameter estimates
- about / Simple linear regression
parameter tuning / Tuning stock models for better performance
pattern discovery / Types of machine learning algorithms
Pearson's chi-squared test for independence / Examining relationships – two-way cross-tabulations
Pearson correlation coefficient
- about / Correlations
percentiles / Measuring spread – quartiles and the five-number summary
performance
- measuring, for classification / Measuring performance for classification
- measuring, confusion matrix used / Using confusion matrices to measure performance
performance measures / Beyond accuracy – other measures of performance
performance tradeoffs
- visualizing, with ROC curves / Visualizing performance tradeoffs with ROC curves
pipe operator
- about / Speeding and simplifying data preparation with dplyr
poisonous mushrooms, identifying with rule learners
- about / Example – identifying poisonous mushrooms with rule learners
- data collection / Step 1 – collecting data
- data exploration / Step 2 – exploring and preparing the data
- data preparation / Step 2 – exploring and preparing the data
- model, training on data / Step 3 – training a model on the data
- model performance, evaluating / Step 4 – evaluating model performance
- model performance, improving / Step 5 – improving model performance
Poisson regression / Understanding regression
polynomial kernel / Using kernels for nonlinear spaces
positive class predictions / A closer look at confusion matrices
positive predictive value / Precision and recall
post-pruning / Pruning the decision tree
posterior probability / Computing conditional probability with Bayes' theorem
pre-pruning / Pruning the decision tree
precision / Precision and recall
prediction accuracy / Using confusion matrices to measure performance
predictive model / Types of machine learning algorithms
prior probability / Computing conditional probability with Bayes' theorem
probability
- about / Understanding probability
- joint probability / Understanding joint probability
pROC
- reference / Visualizing performance tradeoffs with ROC curves
pseudorandom number generator / Data preparation – creating random training and test datasets
pure / Choosing the best split
purity / Choosing the best split

Q

quadratic optimization / The case of linearly separable data
quantiles / Measuring spread – quartiles and the five-number summary
quartiles / Measuring spread – quartiles and the five-number summary
quintiles / Measuring spread – quartiles and the five-number summary

R

1R algorithm
- about / The 1R algorithm
- strengths / The 1R algorithm
- weaknesses / The 1R algorithm
R
- data structures / R data structures
- data, managing / Managing data with R
radial basis function (RBF) / Activation functions
random-access memory (RAM) / Data storage
random forest models
- strengths / Random forests
- weaknesses / Random forests
random forest performance
- evaluating, in simulated competition / Evaluating random forest performance in a simulated competition
random forests / Random forests
- training / Training random forests
random sample / Data preparation – creating random training and test datasets
range / Measuring spread – quartiles and the five-number summary
ranger
- random forests faster, growing / Growing random forests faster with ranger
RCurl package
- reference / Downloading the complete text of web pages
readr package
- tidy tables, importing with / Importing tidy tables with readr
real-world data
- managing / Managing and preparing real-world data
- preparing / Managing and preparing real-world data
recall / Precision and recall
receiver operating characteristic (ROC) curve / Visualizing performance tradeoffs with ROC curves
rectifier / Step 5 – improving model performance
rectifier linear unit (ReLU) / Step 5 – improving model performance
recurrent network / The direction of information travel
recursive partitioning
- about / Divide and conquer
regression
- about / Understanding regression
- simple linear regression / Simple linear regression
- multiple linear regression / Multiple linear regression
- adding, to trees / Adding regression to trees
regression analysis / Understanding regression
regression trees
- about / Understanding regression trees and model trees
- strengths / Adding regression to trees
- weaknesses / Adding regression to trees
reinforcement learning / Types of machine learning algorithms
relationships
- exploring, between variables / Exploring relationships between variables
- visualizing / Visualizing relationships – scatterplots
repeated holdout / The holdout method
repeated k-fold CV / Cross-validation
residuals / Ordinary least squares estimation
resubstitution error / Estimating future performance
RHadoop project
- reference / Parallel cloud computing with MapReduce and Hadoop
rio package
- Microsoft Excel files, importing / Importing Microsoft Excel, SAS, SPSS, and Stata files with rio
- SAS files, importing / Importing Microsoft Excel, SAS, SPSS, and Stata files with rio
- SPSS files, importing / Importing Microsoft Excel, SAS, SPSS, and Stata files with rio
- Stata files, importing / Importing Microsoft Excel, SAS, SPSS, and Stata files with rio
- reference / Importing Microsoft Excel, SAS, SPSS, and Stata files with rio
RIPPER algorithm
- about / The RIPPER algorithm
- strengths / The RIPPER algorithm
- weaknesses / The RIPPER algorithm
ROC curves
- performance tradeoffs, visualizing with / Visualizing performance tradeoffs with ROC curves
root node
- about / Understanding decision trees
rote learning / Why is the k-NN algorithm lazy?
R packages
- installing / Installing R packages
- loading / Loading and unloading R packages
- unloading / Loading and unloading R packages
R performance, improving
- about / Improving the performance of R
- large datasets, managing / Managing very large datasets
- parallel computing, using / Learning faster with parallel computing
- optimized learning algorithms, deploying / Deploying optimized learning algorithms
- GPU computing / GPU computing
RStudio
- installing / Installing RStudio
- reference / Installing RStudio
rule learner / What makes trees and rules greedy?
rules
- greedy approach / What makes trees and rules greedy?
RWeka / Installing R packages

S

sample SAM ham / Step 1 – collecting data
sample SMS spam / Step 1 – collecting data
SAS files
- importing, with rio / Importing Microsoft Excel, SAS, SPSS, and Stata files with rio
scatterplot matrix / Visualizing relationships among features – the scatterplot matrix
scatterplots
- visualizing / Visualizing relationships – scatterplots
segmentation analysis / Types of machine learning algorithms
semi-supervised learning
- about / Clustering as a machine learning task
sensitivity / Sensitivity and specificity
separate and conquer
- about / Separate and conquer
short message service (SMS) / Example – filtering mobile phone spam with the Naive Bayes algorithm
sigmoid activation function / Activation functions
sigmoid kernel / Using kernels for nonlinear spaces
simple linear regression
- about / Understanding regression, Simple linear regression
single-layer network / The number of layers
slack variable / The case of nonlinearly separable data
slope / Understanding regression
slope-intercept form
- about / Understanding regression
SmoothReLU / Step 5 – improving model performance
SMS Spam Collection
- reference / Step 1 – collecting data
SnowballC package
- reference / Data preparation – cleaning and standardizing text data
snow package
- about / Working in parallel with multicore and snow
social networking service (SNS) / Finding teen market segments using k-means clustering
softplus / Step 5 – improving model performance
Sparkling Water
- about / A faster machine learning computing engine with H2O
sparse matrix / Data preparation – splitting text documents into words
- plotting / Visualizing the transaction data – plotting the sparse matrix
specificity / Sensitivity and specificity
spread
- measuring / Measuring spread – quartiles and the five-number summary
SPSS files
- importing, with rio / Importing Microsoft Excel, SAS, SPSS, and Stata files with rio
SQL connectivity
- with RODBC / A traditional approach to SQL connectivity with RODBC
SQL databases
- data, querying in / Querying data in SQL databases
squashing functions / Activation functions
stacking
- about / Understanding ensembles
standard deviation / Measuring spread – variance and standard deviation
standard deviation reduction (SDR) / Adding regression to trees
Stata files
- importing, with rio / Importing Microsoft Excel, SAS, SPSS, and Stata files with rio
statistical hypothesis testing / Understanding regression
stock models
- tuning, for better performance / Tuning stock models for better performance
stop words / Data preparation – cleaning and standardizing text data
stratified random sampling / The holdout method
strong rules / Measuring rule interest – support and confidence
structured data / Types of input data
Structured Query Language (SQL) / Querying data in SQL databases
subtree raising / Pruning the decision tree
subtree replacement / Pruning the decision tree
success rate / Using confusion matrices to measure performance
summary statistics / Exploring numeric variables
sum of squared errors (SSE) / Step 3 – training a model on the data
sum of the squared errors (SSE) / Ordinary least squares estimation
supervised learning / Types of machine learning algorithms
support vector machine (SVM)
- about / Understanding support vector machines
- applications / Understanding support vector machines
support vectors / Classification with hyperplanes
SVMlight
- reference / Step 3 – training a model on the data
SVMs, with non-linear kernels
- strengths / Using kernels for nonlinear spaces
- weaknesses / Using kernels for nonlinear spaces
synapse / From biological to artificial neurons

T

tab-separated values (TSV) / Importing and saving data from CSV files
tabular data structures
- generalizing, with tibble package / Generalizing tabular data structures with tibble
TensorFlow
- reference / Flexible numeric computing and machine learning with TensorFlow
- flexible numeric computing / Flexible numeric computing and machine learning with TensorFlow
- machine learning / Flexible numeric computing and machine learning with TensorFlow
tensors / Flexible numeric computing and machine learning with TensorFlow
term-document matrix (TDM) / Data preparation – splitting text documents into words
terminal nodes
- about / Understanding decision trees
tertiles / Measuring spread – quartiles and the five-number summary
test dataset / Evaluation
threshold activation function / Activation functions
tibble package
- tabular data structures, generalizing with / Generalizing tabular data structures with tibble
tidy tables
- importing, with readr package / Importing tidy tables with readr
tidyverse packages
- using / Making data "tidy" with the tidyverse packages
- reference / Making data "tidy" with the tidyverse packages
tm package / Data preparation – cleaning and standardizing text data
tokenization / Data preparation – splitting text documents into words
training / Abstraction
training algorithm / From biological to artificial neurons
training dataset / Evaluation
trees
- greedy approach / What makes trees and rules greedy?
- regression, adding to / Adding regression to trees
tree structure
- about / Understanding decision trees
trials
- about / Basic concepts of Bayesian methods
true negative rate / Sensitivity and specificity
true positive rate / Sensitivity and specificity
tuned model
- creating / Creating a simple tuned model
tuning process
- customizing / Customizing the tuning process
Turing test
- about / Understanding neural networks
- reference / Understanding neural networks
two-way cross-tabulation / Examining relationships – two-way cross-tabulations

U

uniform distribution / Understanding numeric data – uniform and normal distributions
Uniform Resource Locator (URL) / Working with online data and services
unimodal / Measuring the central tendency – the mode
unit of analysis / Types of input data
unit of observation / Types of input data
unit step activation function / Activation functions
univariate statistics / Exploring relationships between variables
universal function approximator / The number of nodes in each layer
unstructured data / Types of input data
unsupervised classification
- about / Clustering as a machine learning task
unsupervised learning / Types of machine learning algorithms

V

validation dataset / The holdout method
vanishing gradient problem / Step 5 – improving model performance
variables
- relationships, exploring between / Exploring relationships between variables
variance / Measuring spread – variance and standard deviation
vcd package / The kappa statistic
vectors
- about / Vectors
Venn diagram / Understanding joint probability
Visualizing Categorical Data / The kappa statistic
Voronoi diagram / Using distance to assign and update clusters

W

web APIs
- JSON, parsing from / Parsing JSON from web APIs
web pages
- data, pasring within / Parsing the data within web pages
weighted voting process / Choosing an appropriate k
Weka
- reference / Installing R packages
wine quality estimation, with regression trees/model trees
- about / Example – estimating the quality of wines with regression trees and model trees
- data collection / Step 1 – collecting data
- data preparation / Step 2 – exploring and preparing the data
- data exploration / Step 2 – exploring and preparing the data
- model, training on data / Step 3 – training a model on the data
- decision trees, visualizing / Visualizing decision trees
- model performance, evaluating / Step 4 – evaluating model performance
- performance, measuring with mean absolute error / Measuring performance with the mean absolute error
- model performance, improving / Step 5 – improving model performance
word cloud / Visualizing text data – word clouds
wordcloud package
- reference / Visualizing text data – word clouds

X

xml2 homepage
- reference / Parsing XML documents
XML documents
- parsing / Parsing XML documents
XML package
- reference / Parsing XML documents

Z

z-score / Preparing data for use with k-NN
z-score standardization / Preparing data for use with k-NN
ZeroR
- about / The 1R algorithm