Book Image

Machine Learning with R - Third Edition

By : Brett Lantz
Book Image

Machine Learning with R - Third Edition

By: Brett Lantz

Overview of this book

Machine learning, at its core, is concerned with transforming data into actionable knowledge. R offers a powerful set of machine learning methods to quickly and easily gain insight from your data. Machine Learning with R, Third Edition provides a hands-on, readable guide to applying machine learning to real-world problems. Whether you are an experienced R user or new to the language, Brett Lantz teaches you everything you need to uncover key insights, make new predictions, and visualize your findings. This new 3rd edition updates the classic R data science book to R 3.6 with newer and better libraries, advice on ethical and bias issues in machine learning, and an introduction to deep learning. Find powerful new insights in your data; discover machine learning with R.
Table of Contents (18 chapters)
Machine Learning with R - Third Edition
Contributors
Preface
Other Books You May Enjoy
Leave a review - let other readers know what you think
Index

Index

A

  • activation function / From biological to artificial neurons, Activation functions
  • AdaBoost / Boosting
  • AdaBoost.M1 algorithm / Boosting
  • adaptive boosting / Boosting the accuracy of decision trees, Boosting
  • adversarial learning / Types of machine learning algorithms
  • algorithms
    • input data, matching to / Matching input data to algorithms
  • allocation function
    • about / Understanding ensembles
  • Amazon Web Services (AWS) / Step 5 – improving model performance
  • ANNs, used for modeling concrete strength
    • about / Example – modeling the strength of concrete with ANNs
    • data collection / Step 1 – collecting data
    • data exploration / Step 2 – exploring and preparing the data
    • data preparation / Step 2 – exploring and preparing the data
    • model, training on data / Step 3 – training a model on the data
    • model performance, evaluating / Step 4 – evaluating model performance
    • model performance, improving / Step 5 – improving model performance
  • antecedent
    • about / Understanding classification rules
  • Apache Hadoop
    • about / Parallel cloud computing with MapReduce and Hadoop
  • Apache Spark
    • parallel cloud computing / Parallel cloud computing with Apache Spark
  • application programming interfaces (API) / Parsing JSON from web APIs
  • Apriori / The Apriori algorithm for association rule learning
  • Apriori algorithm
    • for association rule learning / The Apriori algorithm for association rule learning
    • strengths / The Apriori algorithm for association rule learning
    • weaknesses / The Apriori algorithm for association rule learning
  • Apriori principle
    • set of rules, building / Building a set of rules with the Apriori principle
  • Apriori property / The Apriori algorithm for association rule learning
  • area under the ROC curve (AUC) / Visualizing performance tradeoffs with ROC curves
  • arrays
    • about / Matrices and arrays
  • artificial neural network (ANN)
    • about / Understanding neural networks
  • artificial neurons
    • about / From biological to artificial neurons
  • association rules
    • about / Understanding association rules
    • left-hand side (LHS) / Understanding association rules
    • right-hand side (RHS) / Understanding association rules
    • applications / Understanding association rules
    • rule interest, measuring / Measuring rule interest – support and confidence
  • automated parameter tuning
    • caret, using for / Using caret for automated parameter tuning
  • axis-parallel splits / Divide and conquer
  • axon / From biological to artificial neurons

B

  • 0.632 bootstrap / Bootstrap sampling
  • backpropagation
    • neural networks, training / Training neural networks with backpropagation
    • about / Training neural networks with backpropagation
  • bag-of-words / Step 2 – exploring and preparing the data
  • bagging / Bagging
  • Bayes' theorem
    • conditional probability, computing / Computing conditional probability with Bayes' theorem
  • Bayesian classifiers
    • uses / Understanding Naive Bayes
  • Bayesian methods
    • about / Understanding Naive Bayes
    • concepts / Basic concepts of Bayesian methods
  • Beowulf cluster
    • about / Working in parallel with multicore and snow
  • betweenness centrality / Analyzing and visualizing network data
  • bias-variance tradeoff / Choosing an appropriate k
  • big data / The origins of machine learning
  • biglm
    • bigger regression models, building / Building bigger regression models with biglm
  • bigmemory package
    • massive matrices, using with / Using massive matrices with bigmemory
    • reference / Using massive matrices with bigmemory
  • bigrf
    • massive random forests, growing / Growing massive random forests with bigrf
    • reference / Growing massive random forests with bigrf
  • bimodal / Measuring the central tendency – the mode
  • binning / Using numeric features with Naive Bayes
  • bins / Using numeric features with Naive Bayes
  • Bioconductor project
    • reference / Analyzing bioinformatics data
  • bioinformatics data
    • analyzing / Analyzing bioinformatics data
  • biological neurons
    • about / From biological to artificial neurons
  • bits / Choosing the best split
  • bivariate relationships / Exploring relationships between variables
  • body mass index (BMI) / Step 1 – collecting data
  • boosting / Boosting
  • bootstrap aggregating / Bagging
  • bootstrap sampling / Bootstrap sampling
  • box-and-whisker plot / Visualizing numeric variables – boxplots
  • boxplot
    • visualizing / Visualizing numeric variables – boxplots
  • branches
    • about / Understanding decision trees
  • breast cancer diagnose, with k-NN algorithm
    • about / Example – diagnosing breast cancer with the k-NN algorithm
    • data collection / Step 1 – collecting data
    • data exploration / Step 2 – exploring and preparing the data
    • data preparation / Step 2 – exploring and preparing the data
    • numeric data, normalizing / Transformation – normalizing numeric data
    • training dataset, creating / Data preparation – creating training and test datasets
    • test dataset, creating / Data preparation – creating training and test datasets
    • model, training on data / Step 3 – training a model on the data
    • model performance, evaluating / Step 4 – evaluating model performance
    • model performance, improving / Step 5 – improving model performance
    • z-score standardization / Transformation – z-score standardization
    • alternative values of k, testing / Testing alternative values of k

C

  • C5.0 decision tree algorithm
    • about / The C5.0 decision tree algorithm
    • strengths / The C5.0 decision tree algorithm
    • weaknesses / The C5.0 decision tree algorithm
    • best split, selecting / Choosing the best split
  • C5.0 decision trees, used for identifying risky bank loans
    • about / Example – identifying risky bank loans using C5.0 decision trees
    • data collection / Step 1 – collecting data
    • data exploration / Step 2 – exploring and preparing the data
    • data preparation / Step 2 – exploring and preparing the data
    • training dataset, creating / Data preparation – creating random training and test datasets
    • test dataset, creating / Data preparation – creating random training and test datasets
    • model, training on data / Step 3 – training a model on the data
    • model performance, evaluating / Step 4 – evaluating model performance
    • model performance, improving / Step 5 – improving model performance
  • caret package / Beyond accuracy – other measures of performance
    • using, for automated parameter tuning / Using caret for automated parameter tuning
    • models, training in parallel / Training and evaluating models in parallel with caret
    • models, evaluating in parallel / Training and evaluating models in parallel with caret
  • categorical data / Types of input data
  • categorical variables
    • exploring / Exploring categorical variables
  • cell body / From biological to artificial neurons
  • central processing unit (CPU) / Data storage
  • central tendency
    • measuring / Measuring the central tendency – mean and median, Measuring the central tendency – the mode
  • chi-squared statistic / Choosing the best split
  • class-conditional independence / Classification with Naive Bayes
  • classification / Types of machine learning algorithms
    • with Naive Bayes / Classification with Naive Bayes
    • performance, measuring for / Measuring performance for classification
  • classification, with hyperplanes
    • about / Classification with hyperplanes
    • case of linearly separable data / The case of linearly separable data
    • case of non-linearly separable data / The case of nonlinearly separable data
  • classification and regression tree (CART)
    • about / Understanding regression trees and model trees
  • classification rules
    • about / Understanding classification rules
    • separate and conquer / Separate and conquer
    • 1R algorithm / The 1R algorithm
    • RIPPER algorithm / The RIPPER algorithm
  • classifier
    • predictions / Understanding a classifier's predictions
  • class imbalance problem / Measuring performance for classification
  • clustering / Types of machine learning algorithms
    • about / Understanding clustering
    • as machine learning task / Clustering as a machine learning task
  • clusters
    • about / Understanding clustering
  • Cohen's kappa coefficient / The kappa statistic
  • combination function / Understanding ensembles
  • comma-separated values (CSV) / Importing and saving data from CSV files
  • complement / Understanding probability
  • complete text of web pages
    • downloading / Downloading the complete text of web pages
  • Complete Unified Device Architecture (CUDA)
    • about / GPU computing
  • Comprehensive R Archive Network (CRAN)
    • reference / Machine learning with R
  • conditional probability
    • computing, with Bayes' theorem / Computing conditional probability with Bayes' theorem
    • about / Computing conditional probability with Bayes' theorem
  • confusion matrix / Making some mistakes cost more than others
    • about / Measuring performance for classification, A closer look at confusion matrices
    • used, for measuring performance / Using confusion matrices to measure performance
  • consequent
    • about / Understanding classification rules
  • contingency table / Examining relationships – two-way cross-tabulations
  • control object / Customizing the tuning process
  • convex hull / The case of linearly separable data
  • corpus / Data preparation – cleaning and standardizing text data
  • correlation / Visualizing relationships – scatterplots
    • about / Correlations
  • correlation ellipse / Visualizing relationships among features – the scatterplot matrix
  • correlation matrix / Exploring relationships among features – the correlation matrix
  • cost matrix / Making some mistakes cost more than others
  • covariance function / Ordinary least squares estimation
  • covering algorithms
    • about / Separate and conquer
  • CRAN task view, for clustering
    • reference / The k-means clustering algorithm
  • CRAN Web Technologies and Services task view
    • reference / Working with online data and services
  • cross-validation / Cross-validation
  • crosstab / Examining relationships – two-way cross-tabulations
  • CSV files
    • data, importing from / Importing and saving data from CSV files
    • data, saving from / Importing and saving data from CSV files
  • Cubist algorithm / Step 5 – improving model performance
  • cut points / Using numeric features with Naive Bayes

D

  • data
    • managing, with R / Managing data with R
    • importing, from CSV files / Importing and saving data from CSV files
    • saving, from CSV files / Importing and saving data from CSV files
    • exploring / Exploring and understanding data
    • structure / Exploring the structure of data
    • querying, in SQL databases / Querying data in SQL databases
    • parsing, within web pages / Parsing the data within web pages
  • database backend
    • using, with dplyr / Using a database backend with dplyr
  • database connections
    • managing / The tidy approach to managing database connections
  • database management system (DBMS) / Querying data in SQL databases
  • data frames
    • about / Data frames
  • data mining
    • about / The origins of machine learning
  • data munging / Managing and preparing real-world data
  • data preparation
    • speeding up, with dplyr / Speeding and simplifying data preparation with dplyr
    • simplifying, with dplyr / Speeding and simplifying data preparation with dplyr
  • data source name (DSN) / The tidy approach to managing database connections
  • data structures, R
    • about / R data structures
    • vectors / Vectors
    • factors / Factors
    • lists / Lists
    • data frames / Data frames
    • matrices / Matrices and arrays
    • arrays / Matrices and arrays
    • saving / Saving, loading, and removing R data structures
    • loading / Saving, loading, and removing R data structures
    • removing / Saving, loading, and removing R data structures
  • data table
    • used, for making data frames faster / Making data frames faster with data.table
    • reference / Making data frames faster with data.table
  • data wrangling / Managing and preparing real-world data
  • deciles / Measuring spread – quartiles and the five-number summary
  • decision nodes
    • about / Understanding decision trees
  • decision tree
    • pruning / Pruning the decision tree
  • decision tree algorithms
    • benefits / Understanding decision trees
  • decision tree forests / Random forests
  • decision trees
    • about / Understanding decision trees
    • divide and conquer approach / Divide and conquer
    • accuracy, boosting of / Boosting the accuracy of decision trees
    • rules / Rules from decision trees
  • deep learning / The direction of information travel
    • with Keras / An interface for deep learning with Keras
  • deep neural network (DNN) / The direction of information travel
  • delimiter / Importing and saving data from CSV files
  • dendrites / From biological to artificial neurons
  • dependencies / Installing R packages
  • dependent events / Understanding joint probability
  • dependent variable / Visualizing relationships – scatterplots
    • about / Understanding regression
  • descriptive model / Types of machine learning algorithms
  • disk-based data frames
    • creating, with ff package / Creating disk-based data frames with ff
  • distance function / Measuring similarity with distance
  • divide and conquer approach
    • about / Divide and conquer
  • document-term matrix (DTM) / Data preparation – splitting text documents into words
  • domain-specific data
    • working with / Working with domain-specific data
  • doParallel package
    • about / Taking advantage of parallel with foreach and doParallel
  • dot product / Using kernels for nonlinear spaces
  • dplyr package
    • data prepartaion, speeding up / Speeding and simplifying data preparation with dplyr
    • data prepartaion, simplifying / Speeding and simplifying data preparation with dplyr
    • database backend, using with / Using a database backend with dplyr
  • dummy coding / Preparing data for use with k-NN

E

  • early stopping / Pruning the decision tree
  • edgelist / Analyzing and visualizing network data
  • edges / Analyzing and visualizing network data
  • elbow method / Choosing the appropriate number of clusters
  • elbow point / Choosing the appropriate number of clusters
  • ensemble methods
    • about / Understanding ensembles
    • bagging / Bagging
    • boosting / Boosting
    • random forests / Random forests
  • ensembles / Types of machine learning algorithms
    • about / Understanding ensembles
    • performance advantages / Understanding ensembles
  • entropy / Choosing the best split
  • epoch
    • about / Training neural networks with backpropagation
  • epoch, backpropagation algorithm
    • forward phase / Training neural networks with backpropagation
    • backward phase / Training neural networks with backpropagation
  • error rate / Using confusion matrices to measure performance
  • Euclidean distance / Measuring similarity with distance
  • Euclidean norm / The case of linearly separable data
  • event
    • about / Basic concepts of Bayesian methods
  • exhaustive event / Understanding probability
  • exploding gradient problem / Step 5 – improving model performance
  • external data files
    • reading / Reading and writing to external data files
    • writing to / Reading and writing to external data files

F

  • F-measure / The F-measure
  • F-score / The F-measure
  • factors
    • about / Factors
  • feedback network / The direction of information travel
  • feedforward networks / The direction of information travel
  • ffbase project
    • reference / Creating disk-based data frames with ff
  • ff package
    • disk-based data frames, creating / Creating disk-based data frames with ff
    • reference / Creating disk-based data frames with ff
  • five-number summary / Measuring spread – quartiles and the five-number summary
  • folds / Cross-validation
  • foreach package
    • about / Taking advantage of parallel with foreach and doParallel
  • frequency table / Computing conditional probability with Bayes' theorem
  • frequently purchased groceries, identifying with association rules
    • about / Example – identifying frequently purchased groceries with association rules
    • data collection / Step 1 – collecting data
    • data preparation / Step 2 – exploring and preparing the data
    • data exploration / Step 2 – exploring and preparing the data
    • sparse matrix, creating for transaction data / Data preparation – creating a sparse matrix for transaction data
    • item support, visualizing / Visualizing item support – item frequency plots
    • model, training on data / Step 3 – training a model on the data
    • model performance, evaluating / Step 4 – evaluating model performance
    • model performance, improving / Step 5 – improving model performance
    • set of association rules, sorting / Sorting the set of association rules
    • subsets, taking of association rules / Taking subsets of association rules
    • association rules, saving to file/data frame / Saving association rules to a file or data frame
  • future performance
    • estimating / Estimating future performance

G

  • gain ratio / Choosing the best split
  • Gaussian RBF kernel / Using kernels for nonlinear spaces
  • General Data Protection Regulation (GDPR) / Machine learning ethics
  • generalized linear models (GLM)
    • about / Understanding regression
  • Gini index / Choosing the best split
  • glyph / Step 1 – collecting data
  • Google bombing / Machine learning ethics
  • GPU computing
    • about / GPU computing
  • gradient descent
    • about / Training neural networks with backpropagation
  • Graph Modeling Language (GML) / Analyzing and visualizing network data
  • greedy learners / What makes trees and rules greedy?

H

  • H2O Flow
    • about / A faster machine learning computing engine with H2O
  • H2O project
    • about / A faster machine learning computing engine with H2O
  • Hadoop
    • parallel cloud computing / Parallel cloud computing with MapReduce and Hadoop
  • harmonic mean / The F-measure
  • heuristics / Generalization
  • hidden layers / The number of layers
  • histograms
    • visualizing / Visualizing numeric variables – histograms
  • holdout method / The holdout method
  • httr
    • reference / Downloading the complete text of web pages
  • hyperplane / Understanding support vector machines
    • using, in classification / Classification with hyperplanes
  • Hypertext Markup Language (HTML) / Downloading the complete text of web pages
  • hypothesis testing / Understanding regression

I

  • igraph package
    • reference / Analyzing and visualizing network data
  • image processing / Example – performing OCR with SVMs
  • imputation / Data preparation – imputing the missing values
  • Incremental Reduced Error Pruning (IREP) algorithm / The RIPPER algorithm
  • independent events / Understanding joint probability
  • independent variables
    • about / Understanding regression
  • information gain / Choosing the best split
  • input data
    • matching, to algorithms / Matching input data to algorithms
  • instance-based learning / Why is the k-NN algorithm lazy?
  • intercept / Understanding regression
  • interquartile range (IQR) / Measuring spread – quartiles and the five-number summary
  • Interrater Reliability (irr) package / The kappa statistic
  • intersection / Understanding joint probability
  • item frequency plots / Visualizing item support – item frequency plots
  • itemset
    • about / Understanding association rules
  • Iterative Dichotomiser 3 (ID3) algorithm / The C5.0 decision tree algorithm

J

  • J48 / The C5.0 decision tree algorithm
  • Java
    • download link / Installing R packages
  • JavaScript Object Notation (JSON) / Parsing JSON from web APIs
  • joint probability
    • about / Understanding joint probability
  • JSON
    • parsing, from web APIs / Parsing JSON from web APIs
    • reference / Parsing JSON from web APIs
  • jsonlite package
    • reference / Parsing JSON from web APIs

K

  • k-fold cross-validation (k-fold CV) / Cross-validation
  • k-means++ algorithm / Using distance to assign and update clusters
  • k-means algorithm
    • about / The k-means clustering algorithm
    • strengths / The k-means clustering algorithm
    • weaknesses / The k-means clustering algorithm
    • distance, used for assigning clusters / Using distance to assign and update clusters
    • distance, used for updating clusters / Using distance to assign and update clusters
    • appropriate number of clusters, selecting / Choosing the appropriate number of clusters
  • k-means clustering, used for finding teen market segments
    • about / Finding teen market segments using k-means clustering
    • data collection / Step 1 – collecting data
    • data exploration / Step 2 – exploring and preparing the data
    • data preparation / Step 2 – exploring and preparing the data
    • missing values, dummy coding / Data preparation – dummy coding missing values
    • missing values, imputing / Data preparation – imputing the missing values
    • model, training on data / Step 3 – training a model on the data
    • model performance, evaluating / Step 4 – evaluating model performance
    • model performance, improving / Step 5 – improving model performance
  • k-nearest neighbors (k-NN) algorithm / The k-means clustering algorithm
  • k-NN algorithm
    • strengths / The k-NN algorithm
    • weaknesses / The k-NN algorithm
    • about / The k-NN algorithm
    • example / The k-NN algorithm
    • similarity, measuring with distance / Measuring similarity with distance
    • appropriate k, selecting / Choosing an appropriate k
    • data, preparing for usage with / Preparing data for use with k-NN
    • lazy learning algorithm / Why is the k-NN algorithm lazy?
  • kappa statistic / The kappa statistic
  • Keras
    • reference / An interface for deep learning with Keras
    • deep learning / An interface for deep learning with Keras
  • kernels
    • using, for non-linear spaces / Using kernels for nonlinear spaces
  • kernel trick / Using kernels for nonlinear spaces
  • kernlab
    • reference / Step 3 – training a model on the data
  • knowledge representation / Abstraction

L

  • Laplace estimator / The Laplace estimator
  • leaf nodes
    • about / Understanding decision trees
  • learning rate
    • about / Training neural networks with backpropagation
  • leave-one-out method / Cross-validation
  • levels / Types of machine learning algorithms
  • libstemmer library / Data preparation – cleaning and standardizing text data
  • LIBSVM
    • reference / Step 3 – training a model on the data
  • likelihood / Computing conditional probability with Bayes' theorem
  • likelihood table / Computing conditional probability with Bayes' theorem
  • linear kernel / Using kernels for nonlinear spaces
  • link function
    • about / Understanding regression
  • links / Analyzing and visualizing network data
  • lists
    • about / Lists
  • LOESS curve / Visualizing relationships among features – the scatterplot matrix
  • logistic regression
    • about / Understanding regression

M

  • machine learning
    • origins / The origins of machine learning
    • successes / Uses and abuses of machine learning, Machine learning successes
    • limits / The limits of machine learning
    • ethics / Machine learning ethics
    • about / How machines learn
    • data storage / How machines learn, Data storage
    • abstraction / How machines learn, Abstraction
    • generalization / How machines learn, Generalization
    • evaluation / How machines learn, Evaluation
    • working / Machine learning in practice
    • data collection / Machine learning in practice
    • data exploration / Machine learning in practice
    • data preparation / Machine learning in practice
    • model training / Machine learning in practice
    • model evaluation / Machine learning in practice
    • model improvement / Machine learning in practice
    • input data / Types of input data
    • with R / Machine learning with R
  • machine learning algorithms
    • types / Types of machine learning algorithms
  • magrittr package
    • reference / Speeding and simplifying data preparation with dplyr
  • Manhattan distance / Measuring similarity with distance
  • MapReduce
    • about / Parallel cloud computing with MapReduce and Hadoop
    • map step / Parallel cloud computing with MapReduce and Hadoop
    • reduce step / Parallel cloud computing with MapReduce and Hadoop
    • parallel cloud computing / Parallel cloud computing with MapReduce and Hadoop
  • marginal likelihood / Computing conditional probability with Bayes' theorem
  • market basket analysis / Types of machine learning algorithms
  • massive matrices
    • using, with bigmemory package / Using massive matrices with bigmemory
  • matrix
    • about / Matrices and arrays
  • matrix format data / Types of input data
  • matrix inverse / Multiple linear regression
  • matrix notation / Multiple linear regression
  • maximum margin hyperplane (MMH) / Classification with hyperplanes
  • mean / Measuring the central tendency – mean and median
  • mean absolute error (MAE) / Measuring performance with the mean absolute error
  • median / Measuring the central tendency – mean and median
  • medical expenses, predicting with linear regression
    • about / Example – predicting medical expenses using linear regression
    • data collection / Step 1 – collecting data
    • data preparation / Step 2 – exploring and preparing the data
    • data exploration / Step 2 – exploring and preparing the data
    • relationships, exploring among features / Exploring relationships among features – the correlation matrix
    • relationships, visualizing among features / Visualizing relationships among features – the scatterplot matrix
    • model, training on data / Step 3 – training a model on the data
    • model performance, evaluating / Step 4 – evaluating model performance
    • model performance, improving / Step 5 – improving model performance
    • model specification / Model specification – adding nonlinear relationships
    • non-linear relationships, adding / Model specification – adding nonlinear relationships
    • numeric variable, converting to binary indicator / Transformation – converting a numeric variable to a binary indicator
    • transformation / Transformation – converting a numeric variable to a binary indicator
    • interaction effects, adding / Model specification – adding interaction effects
    • improved regression model / Putting it all together – an improved regression model
    • predictions, making with regression model / Making predictions with a regression model
  • message passing interface (MPI)
    • about / Working in parallel with multicore and snow
  • meta-learners / Types of machine learning algorithms
  • meta-learning
    • model performance, improving with / Improving model performance with meta-learning
  • microarray / Analyzing bioinformatics data
  • Microsoft Azure / Step 5 – improving model performance
  • Microsoft Excel files
    • importing, with rio / Importing Microsoft Excel, SAS, SPSS, and Stata files with rio
  • min-max normalization / Preparing data for use with k-NN
  • mobile phone filtering, with Naive Bayes algorithm
    • about / Example – filtering mobile phone spam with the Naive Bayes algorithm
    • data collection / Step 1 – collecting data
    • data exploration / Step 2 – exploring and preparing the data
    • data preparation / Step 2 – exploring and preparing the data
    • text data, cleaning / Data preparation – cleaning and standardizing text data
    • text data, standardizing / Data preparation – cleaning and standardizing text data
    • text documents, splitting into words / Data preparation – splitting text documents into words
    • training dataset, creating / Data preparation – creating training and test datasets
    • test dataset, creating / Data preparation – creating training and test datasets
    • text data, visualizing / Visualizing text data – word clouds
    • indicator features, creating for frequent words / Data preparation – creating indicator features for frequent words
    • model, training on data / Step 3 – training a model on the data
    • model performance, evaluating / Step 4 – evaluating model performance
    • model performance, improving / Step 5 – improving model performance
  • model performance
    • improving, with meta-learning / Improving model performance with meta-learning
  • model trees
    • about / Understanding regression trees and model trees
  • multicore package
    • about / Working in parallel with multicore and snow
  • multilayer network / The number of layers
  • multilayer perceptron (MLP) / The direction of information travel
  • multimodal / Measuring the central tendency – the mode
  • multinomial logistic regression
    • about / Understanding regression
  • multiple linear regression
    • about / Understanding regression, Multiple linear regression
    • strengths / Multiple linear regression
    • weaknesses / Multiple linear regression
  • multiple regression
    • about / Understanding regression
  • multivariate relationships / Exploring relationships between variables
  • mutually exclusive event / Understanding probability

N

  • Naive Bayes
    • about / Understanding Naive Bayes
    • using, in classification / Classification with Naive Bayes
    • numeric features, using with / Using numeric features with Naive Bayes
  • Naive Bayes algorithm
    • about / The Naive Bayes algorithm
    • strengths / The Naive Bayes algorithm
    • weaknesses / The Naive Bayes algorithm
  • nearest neighbor classification
    • about / Understanding nearest neighbor classification
    • k-NN algorithm / The k-NN algorithm
  • negative class predictions / A closer look at confusion matrices
  • network analysis / Analyzing and visualizing network data
  • network data
    • analyzing / Analyzing and visualizing network data
    • visualizing / Analyzing and visualizing network data
  • network topology / From biological to artificial neurons
    • about / Network topology
    • number of layers / The number of layers
    • direction of information travel / The direction of information travel
    • number of node, in each layer / The number of nodes in each layer
  • neural networks
    • characteristics / From biological to artificial neurons
    • training, with backpropagation / Training neural networks with backpropagation
  • neurons
    • about / Understanding neural networks
  • nodes
    • about / Understanding neural networks
    / Analyzing and visualizing network data
  • No Free Lunch theorem
    • reference / Evaluation
  • nominal data / Types of input data
  • non-linear spaces
    • kernels, using for / Using kernels for nonlinear spaces
  • non-parametric learning methods / Why is the k-NN algorithm lazy?
  • normal distribution / Understanding numeric data – uniform and normal distributions
  • numeric data / Types of input data
  • numeric features
    • using, with Naive Bayes / Using numeric features with Naive Bayes
  • numeric prediction / Types of machine learning algorithms
  • numeric variables
    • exploring / Exploring numeric variables

O

  • OCR, performing with SVMs
    • about / Example – performing OCR with SVMs
    • data collection / Step 1 – collecting data
    • data exploration / Step 2 – exploring and preparing the data
    • data preparation / Step 2 – exploring and preparing the data
    • model, training on data / Step 3 – training a model on the data
    • model performance, evaluating / Step 4 – evaluating model performance
    • model performance, improving / Step 5 – improving model performance
    • SVM kernel function, modifying / Changing the SVM kernel function
    • best SVM cost parameter, identifying / Identifying the best SVM cost parameter
  • one-way table / Exploring categorical variables
  • one hot encoding / Preparing data for use with k-NN
  • online data
    • working with / Working with online data and services
  • online services
    • working with / Working with online data and services
  • Open Database Connectivity (ODBC) / The tidy approach to managing database connections
  • optical character recognition (OCR) / Example – performing OCR with SVMs
  • optimized learning algorithms
    • deploying / Deploying optimized learning algorithms
  • ordinal / Types of input data
  • ordinary least squares (OLS)
    • about / Ordinary least squares estimation
  • out-of-bag error rate / Training random forests
  • overfitting / Evaluation

P

  • parallel cloud computing
    • with MapReduce / Parallel cloud computing with MapReduce and Hadoop
    • with Hadoop / Parallel cloud computing with MapReduce and Hadoop
    • with Apache Spark / Parallel cloud computing with Apache Spark
  • parallel computing
    • about / Learning faster with parallel computing
    • execution time, measuring / Measuring execution time
  • parallel package
    • about / Working in parallel with multicore and snow
  • parameter estimates
    • about / Simple linear regression
  • parameter tuning / Tuning stock models for better performance
  • pattern discovery / Types of machine learning algorithms
  • Pearson's chi-squared test for independence / Examining relationships – two-way cross-tabulations
  • Pearson correlation coefficient
    • about / Correlations
  • percentiles / Measuring spread – quartiles and the five-number summary
  • performance
    • measuring, for classification / Measuring performance for classification
    • measuring, confusion matrix used / Using confusion matrices to measure performance
  • performance measures / Beyond accuracy – other measures of performance
  • performance tradeoffs
    • visualizing, with ROC curves / Visualizing performance tradeoffs with ROC curves
  • pipe operator
    • about / Speeding and simplifying data preparation with dplyr
  • poisonous mushrooms, identifying with rule learners
    • about / Example – identifying poisonous mushrooms with rule learners
    • data collection / Step 1 – collecting data
    • data exploration / Step 2 – exploring and preparing the data
    • data preparation / Step 2 – exploring and preparing the data
    • model, training on data / Step 3 – training a model on the data
    • model performance, evaluating / Step 4 – evaluating model performance
    • model performance, improving / Step 5 – improving model performance
  • Poisson regression / Understanding regression
  • polynomial kernel / Using kernels for nonlinear spaces
  • positive class predictions / A closer look at confusion matrices
  • positive predictive value / Precision and recall
  • post-pruning / Pruning the decision tree
  • posterior probability / Computing conditional probability with Bayes' theorem
  • pre-pruning / Pruning the decision tree
  • precision / Precision and recall
  • prediction accuracy / Using confusion matrices to measure performance
  • predictive model / Types of machine learning algorithms
  • prior probability / Computing conditional probability with Bayes' theorem
  • probability
    • about / Understanding probability
    • joint probability / Understanding joint probability
  • pROC
    • reference / Visualizing performance tradeoffs with ROC curves
  • pseudorandom number generator / Data preparation – creating random training and test datasets
  • pure / Choosing the best split
  • purity / Choosing the best split

Q

  • quadratic optimization / The case of linearly separable data
  • quantiles / Measuring spread – quartiles and the five-number summary
  • quartiles / Measuring spread – quartiles and the five-number summary
  • quintiles / Measuring spread – quartiles and the five-number summary

R

  • 1R algorithm
    • about / The 1R algorithm
    • strengths / The 1R algorithm
    • weaknesses / The 1R algorithm
  • R
    • data structures / R data structures
    • data, managing / Managing data with R
  • radial basis function (RBF) / Activation functions
  • random-access memory (RAM) / Data storage
  • random forest models
    • strengths / Random forests
    • weaknesses / Random forests
  • random forest performance
    • evaluating, in simulated competition / Evaluating random forest performance in a simulated competition
  • random forests / Random forests
    • training / Training random forests
  • random sample / Data preparation – creating random training and test datasets
  • range / Measuring spread – quartiles and the five-number summary
  • ranger
    • random forests faster, growing / Growing random forests faster with ranger
  • RCurl package
    • reference / Downloading the complete text of web pages
  • readr package
    • tidy tables, importing with / Importing tidy tables with readr
  • real-world data
    • managing / Managing and preparing real-world data
    • preparing / Managing and preparing real-world data
  • recall / Precision and recall
  • receiver operating characteristic (ROC) curve / Visualizing performance tradeoffs with ROC curves
  • rectifier / Step 5 – improving model performance
  • rectifier linear unit (ReLU) / Step 5 – improving model performance
  • recurrent network / The direction of information travel
  • recursive partitioning
    • about / Divide and conquer
  • regression
    • about / Understanding regression
    • simple linear regression / Simple linear regression
    • multiple linear regression / Multiple linear regression
    • adding, to trees / Adding regression to trees
  • regression analysis / Understanding regression
  • regression trees
    • about / Understanding regression trees and model trees
    • strengths / Adding regression to trees
    • weaknesses / Adding regression to trees
  • reinforcement learning / Types of machine learning algorithms
  • relationships
    • exploring, between variables / Exploring relationships between variables
    • visualizing / Visualizing relationships – scatterplots
  • repeated holdout / The holdout method
  • repeated k-fold CV / Cross-validation
  • residuals / Ordinary least squares estimation
  • resubstitution error / Estimating future performance
  • RHadoop project
    • reference / Parallel cloud computing with MapReduce and Hadoop
  • rio package
    • Microsoft Excel files, importing / Importing Microsoft Excel, SAS, SPSS, and Stata files with rio
    • SAS files, importing / Importing Microsoft Excel, SAS, SPSS, and Stata files with rio
    • SPSS files, importing / Importing Microsoft Excel, SAS, SPSS, and Stata files with rio
    • Stata files, importing / Importing Microsoft Excel, SAS, SPSS, and Stata files with rio
    • reference / Importing Microsoft Excel, SAS, SPSS, and Stata files with rio
  • RIPPER algorithm
    • about / The RIPPER algorithm
    • strengths / The RIPPER algorithm
    • weaknesses / The RIPPER algorithm
  • ROC curves
    • performance tradeoffs, visualizing with / Visualizing performance tradeoffs with ROC curves
  • root node
    • about / Understanding decision trees
  • rote learning / Why is the k-NN algorithm lazy?
  • R packages
    • installing / Installing R packages
    • loading / Loading and unloading R packages
    • unloading / Loading and unloading R packages
  • R performance, improving
    • about / Improving the performance of R
    • large datasets, managing / Managing very large datasets
    • parallel computing, using / Learning faster with parallel computing
    • optimized learning algorithms, deploying / Deploying optimized learning algorithms
    • GPU computing / GPU computing
  • RStudio
    • installing / Installing RStudio
    • reference / Installing RStudio
  • rule learner / What makes trees and rules greedy?
  • rules
    • greedy approach / What makes trees and rules greedy?
  • RWeka / Installing R packages

S

  • sample SAM ham / Step 1 – collecting data
  • sample SMS spam / Step 1 – collecting data
  • SAS files
    • importing, with rio / Importing Microsoft Excel, SAS, SPSS, and Stata files with rio
  • scatterplot matrix / Visualizing relationships among features – the scatterplot matrix
  • scatterplots
    • visualizing / Visualizing relationships – scatterplots
  • segmentation analysis / Types of machine learning algorithms
  • semi-supervised learning
    • about / Clustering as a machine learning task
  • sensitivity / Sensitivity and specificity
  • separate and conquer
    • about / Separate and conquer
  • short message service (SMS) / Example – filtering mobile phone spam with the Naive Bayes algorithm
  • sigmoid activation function / Activation functions
  • sigmoid kernel / Using kernels for nonlinear spaces
  • simple linear regression
    • about / Understanding regression, Simple linear regression
  • single-layer network / The number of layers
  • slack variable / The case of nonlinearly separable data
  • slope / Understanding regression
  • slope-intercept form
    • about / Understanding regression
  • SmoothReLU / Step 5 – improving model performance
  • SMS Spam Collection
    • reference / Step 1 – collecting data
  • SnowballC package
    • reference / Data preparation – cleaning and standardizing text data
  • snow package
    • about / Working in parallel with multicore and snow
  • social networking service (SNS) / Finding teen market segments using k-means clustering
  • softplus / Step 5 – improving model performance
  • Sparkling Water
    • about / A faster machine learning computing engine with H2O
  • sparse matrix / Data preparation – splitting text documents into words
    • plotting / Visualizing the transaction data – plotting the sparse matrix
  • specificity / Sensitivity and specificity
  • spread
    • measuring / Measuring spread – quartiles and the five-number summary
  • SPSS files
    • importing, with rio / Importing Microsoft Excel, SAS, SPSS, and Stata files with rio
  • SQL connectivity
    • with RODBC / A traditional approach to SQL connectivity with RODBC
  • SQL databases
    • data, querying in / Querying data in SQL databases
  • squashing functions / Activation functions
  • stacking
    • about / Understanding ensembles
  • standard deviation / Measuring spread – variance and standard deviation
  • standard deviation reduction (SDR) / Adding regression to trees
  • Stata files
    • importing, with rio / Importing Microsoft Excel, SAS, SPSS, and Stata files with rio
  • statistical hypothesis testing / Understanding regression
  • stock models
    • tuning, for better performance / Tuning stock models for better performance
  • stop words / Data preparation – cleaning and standardizing text data
  • stratified random sampling / The holdout method
  • strong rules / Measuring rule interest – support and confidence
  • structured data / Types of input data
  • Structured Query Language (SQL) / Querying data in SQL databases
  • subtree raising / Pruning the decision tree
  • subtree replacement / Pruning the decision tree
  • success rate / Using confusion matrices to measure performance
  • summary statistics / Exploring numeric variables
  • sum of squared errors (SSE) / Step 3 – training a model on the data
  • sum of the squared errors (SSE) / Ordinary least squares estimation
  • supervised learning / Types of machine learning algorithms
  • support vector machine (SVM)
    • about / Understanding support vector machines
    • applications / Understanding support vector machines
  • support vectors / Classification with hyperplanes
  • SVMlight
    • reference / Step 3 – training a model on the data
  • SVMs, with non-linear kernels
    • strengths / Using kernels for nonlinear spaces
    • weaknesses / Using kernels for nonlinear spaces
  • synapse / From biological to artificial neurons

T

  • tab-separated values (TSV) / Importing and saving data from CSV files
  • tabular data structures
    • generalizing, with tibble package / Generalizing tabular data structures with tibble
  • TensorFlow
    • reference / Flexible numeric computing and machine learning with TensorFlow
    • flexible numeric computing / Flexible numeric computing and machine learning with TensorFlow
    • machine learning / Flexible numeric computing and machine learning with TensorFlow
  • tensors / Flexible numeric computing and machine learning with TensorFlow
  • term-document matrix (TDM) / Data preparation – splitting text documents into words
  • terminal nodes
    • about / Understanding decision trees
  • tertiles / Measuring spread – quartiles and the five-number summary
  • test dataset / Evaluation
  • threshold activation function / Activation functions
  • tibble package
    • tabular data structures, generalizing with / Generalizing tabular data structures with tibble
  • tidy tables
    • importing, with readr package / Importing tidy tables with readr
  • tidyverse packages
    • using / Making data "tidy" with the tidyverse packages
    • reference / Making data "tidy" with the tidyverse packages
  • tm package / Data preparation – cleaning and standardizing text data
  • tokenization / Data preparation – splitting text documents into words
  • training / Abstraction
  • training algorithm / From biological to artificial neurons
  • training dataset / Evaluation
  • trees
    • greedy approach / What makes trees and rules greedy?
    • regression, adding to / Adding regression to trees
  • tree structure
    • about / Understanding decision trees
  • trials
    • about / Basic concepts of Bayesian methods
  • true negative rate / Sensitivity and specificity
  • true positive rate / Sensitivity and specificity
  • tuned model
    • creating / Creating a simple tuned model
  • tuning process
    • customizing / Customizing the tuning process
  • Turing test
    • about / Understanding neural networks
    • reference / Understanding neural networks
  • two-way cross-tabulation / Examining relationships – two-way cross-tabulations

U

  • uniform distribution / Understanding numeric data – uniform and normal distributions
  • Uniform Resource Locator (URL) / Working with online data and services
  • unimodal / Measuring the central tendency – the mode
  • unit of analysis / Types of input data
  • unit of observation / Types of input data
  • unit step activation function / Activation functions
  • univariate statistics / Exploring relationships between variables
  • universal function approximator / The number of nodes in each layer
  • unstructured data / Types of input data
  • unsupervised classification
    • about / Clustering as a machine learning task
  • unsupervised learning / Types of machine learning algorithms

V

  • validation dataset / The holdout method
  • vanishing gradient problem / Step 5 – improving model performance
  • variables
    • relationships, exploring between / Exploring relationships between variables
  • variance / Measuring spread – variance and standard deviation
  • vcd package / The kappa statistic
  • vectors
    • about / Vectors
  • Venn diagram / Understanding joint probability
  • Visualizing Categorical Data / The kappa statistic
  • Voronoi diagram / Using distance to assign and update clusters

W

  • web APIs
    • JSON, parsing from / Parsing JSON from web APIs
  • web pages
    • data, pasring within / Parsing the data within web pages
  • weighted voting process / Choosing an appropriate k
  • Weka
    • reference / Installing R packages
  • wine quality estimation, with regression trees/model trees
    • about / Example – estimating the quality of wines with regression trees and model trees
    • data collection / Step 1 – collecting data
    • data preparation / Step 2 – exploring and preparing the data
    • data exploration / Step 2 – exploring and preparing the data
    • model, training on data / Step 3 – training a model on the data
    • decision trees, visualizing / Visualizing decision trees
    • model performance, evaluating / Step 4 – evaluating model performance
    • performance, measuring with mean absolute error / Measuring performance with the mean absolute error
    • model performance, improving / Step 5 – improving model performance
  • word cloud / Visualizing text data – word clouds
  • wordcloud package
    • reference / Visualizing text data – word clouds

X

  • xml2 homepage
    • reference / Parsing XML documents
  • XML documents
    • parsing / Parsing XML documents
  • XML package
    • reference / Parsing XML documents

Z

  • z-score / Preparing data for use with k-NN
  • z-score standardization / Preparing data for use with k-NN
  • ZeroR
    • about / The 1R algorithm