Book Image

Data Analysis with R, Second Edition - Second Edition

Book Image

Data Analysis with R, Second Edition - Second Edition

Overview of this book

Frequently the tool of choice for academics, R has spread deep into the private sector and can be found in the production pipelines at some of the most advanced and successful enterprises. The power and domain-specificity of R allows the user to express complex analytics easily, quickly, and succinctly. Starting with the basics of R and statistical reasoning, this book dives into advanced predictive analytics, showing how to apply those techniques to real-world data though with real-world examples. Packed with engaging problems and exercises, this book begins with a review of R and its syntax with packages like Rcpp, ggplot2, and dplyr. From there, get to grips with the fundamentals of applied statistics and build on this knowledge to perform sophisticated and powerful analytics. Solve the difficulties relating to performing data analysis in practice and find solutions to working with messy data, large data, communicating results, and facilitating reproducibility. This book is engineered to be an invaluable resource through many stages of anyone’s career as a data analyst.

Title Page

Copyright and Credits

Copyright and Credits

Packt Upsell

Contributors

Preface

Free Chapter

RefresheR

Navigating the basics

Getting help in R

Loading data into R

Working with packages

The Shape of Data

The Shape of Data

Univariate data

Frequency distributions

Central tendency

Populations, samples, and estimation

Probability distributions

Visualization methods

Describing Relationships

Describing Relationships

Multivariate data

Relationships between a categorical and continuous variable

Relationships between two categorical variables

The relationship between two continuous variables

Visualization methods

Probability

Basic probability

A tale of two interpretations

Sampling from distributions

The normal distribution

Using Data To Reason About The World

Using Data To Reason About The World

Estimating means

The sampling distribution

Interval estimation

Smaller samples

Testing Hypotheses

Testing Hypotheses

The null hypothesis significance testing framework

Testing the mean of one sample

Testing two means

Testing more than two means

Testing independence of proportions

What if my assumptions are unfounded?

Bayesian Methods

Bayesian Methods

The big idea behind Bayesian analysis

Choosing a prior

Who cares about coin flips

Enter MCMC – stage left

Using JAGS and runjags

Fitting distributions the Bayesian way

The Bayesian independent samples t-test

The Bootstrap

What's... uhhh... the deal with the bootstrap?

Performing the bootstrap in R (more elegantly)

Confidence intervals

A one-sample test of means

Bootstrapping statistics other than the mean

Busting bootstrap myths

Predicting Continuous Variables

Predicting Continuous Variables

Simple linear regression

Simple linear regression with a binary predictor

Multiple regression

Regression with a non-binary predictor

Kitchen sink regression

The bias-variance trade-off

Linear regression diagnostics

Advanced topics

Predicting Categorical Variables

Predicting Categorical Variables

k-Nearest neighbors

Logistic regression

Choosing a classifier

Predicting Changes with Time

Predicting Changes with Time

What is a time series?

What is forecasting?

Creating and plotting time series

Components of time series

Time series decomposition

Autocorrelation

ETS and the state space model

Interventions for improvement

What we didn't cover

Citations for the climate change data

Sources of Data

Sources of Data

Relational databases

Other data formats

Online repositories

Dealing with Missing Data

Dealing with Missing Data

Analysis with missing data

Visualizing missing data

Types of missing data

Unsophisticated methods for dealing with missing data

So how does mice come up with the imputed values?

Dealing with Messy Data

Dealing with Messy Data

Checking unsanitized data

Regular expressions

Other tools for messy data

Dealing with Large Data

Dealing with Large Data

Wait to optimize

Using a bigger and faster machine

Be smart about your code

Using optimized packages

Using another R implementation

Using parallelization

Being smarter about your code

Working with Popular R Packages

Working with Popular R Packages

The data.table package

Using dplyr and tidyr to manipulate data

Functional programming as a main tidyverse principle

Reshaping data with tidyr

Reproducibility and Best Practices

Reproducibility and Best Practices

Version control

Communicating results

Other Books You May Enjoy

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Index

A

Accelerate / Using another R implementation
Akaike Information Criterion (AIC) / Striking a balance
Allighate / Basic probability
ANCOVA (Analysis of Covariance)
- about / Testing more than two means, Generalized Linear Model (GLM)
- assumptions / Assumptions of ANOVA
anonymous functions / Functions
Application Programming Interface (API) / Using JSON
arguments / Arithmetic and assignment
arithmetic operations / Arithmetic and assignment
assertions
- chaining / Chaining assertions
assignment operations / Arithmetic and assignment
ATLAS / Using another R implementation
autocorrelation / Autocorrelation
automated forecast model / Interventions for improvement
automatic multithreading / Using another R implementation
available-case analysis / Pairwise deletion
average deviation / Spread

B

bagging / Random forests
bandwidth / Probability distributions
base-rate fallacy / Basic probability
batch mode / Navigating the basics
Bayes' Theorem / Basic probability
Bayesian analogue
- distributions, fitting / Fitting distributions the Bayesian way
- independent samples t-test / The Bayesian independent samples t-test
Bayesian analysis / The big idea behind Bayesian analysis
Bayesian linear regression / Advanced topics
bell curve / Central tendency
Bernoulli distribution / Sampling from distributions
bias-corrected and accelerated confidence interval (BCa) / Confidence intervals
bias-variance trade-off / The bias-variance trade-off
binomial distribution / The binomial distribution
birthday problem / An example of (some) substance
Bonferroni correction / Testing more than two means
Booleans / Logicals and characters
bootstrap
- about / What's... uhhh... the deal with the bootstrap?
- performing, in R / Performing the bootstrap in R (more elegantly)
- myths / Busting bootstrap myths
bootstrap distribution / What's... uhhh... the deal with the bootstrap?
bootstrapping statistics / Bootstrapping statistics other than the mean
bootstrap replications / Performing the bootstrap in R (more elegantly)
Box / Version control
by argument
- used, for grouping / Using the by argument for grouping
by reference semantics / The i in DT [i, j, by], What in the world are by reference semantics?
by value semantics / What in the world are by reference semantics?

C

categorical, and continuous variable
- relationships / Relationships between a categorical and continuous variable
categorical variables
- relationships / Relationships between two categorical variables
central limit theorem / The sampling distribution
central tendency / Central tendency
checkpoint / Package version management
classifiers
- selecting / Choosing a classifier
- vertical decision boundary / The vertical decision boundary
- diagonal decision boundary / The diagonal decision boundary
- crescent decision boundary / The crescent decision boundary
- circular decision boundary / The circular decision boundary
code performance
- about / Be smart about your code
- memory allocation / Allocation of memory
- vectorization / Vectorization
coin flips / Who cares about coin flips
columns
- selecting / Selecting and renaming columns
- renaming / Selecting and renaming columns
- computing / Computing on columns
comments / Arithmetic and assignment
communicating results / Communicating results
complete case analysis / Complete case analysis
Comprehensive R Archive Network (CRAN) / Working with packages
confidence intervals / Confidence intervals
confusion matrices / Confusion matrices
continuous variables
- relationship / The relationship between two continuous variables
copy-on-modify / What in the world are by reference semantics?
copy-on-write / What in the world are by reference semantics?
correlation coefficients / Correlation coefficients
covariance / Covariance
CRAN Task Views
- reference / Other data formats
cross-tabulation / Relationships between two categorical variables
cross-validation / Cross-validation
CVS / Version control

D

data
- loading, into R / Loading data into R
- messy situations / Other tools for messy data
- OpenRefine / OpenRefine
- fuzzy matching / Fuzzy matching
data.table package
- about / The data.table package
- i argument, in DT [i, j, by] / The i in DT [i, j, by]
- by reference semantics / What in the world are by reference semantics?
- j argument, in DT[i, j, by] / The j in DT[i, j, by]
- i and j arguments, using / Using both i and j
- by argument, used for grouping / Using the by argument for grouping
- data tables, joining / Joining data tables
- data, pivoting / Reshaping, melting, and pivoting data
- data, reshaping / Reshaping, melting, and pivoting data
- data, melting / Reshaping, melting, and pivoting data
data formats / Other data formats
data manipulation
- with dplyr / Using dplyr and tidyr to manipulate data
- with tidyr / Using dplyr and tidyr to manipulate data
data normalization / Regex for data normalization
data points
- checking / Checking for outliers, entry errors, or unlikely data points
data type of column
- checking / Checking the data type of a column
decision trees / Decision trees
degrees of freedom / Populations, samples, and estimation
directional hypothesis / One and two-tailed tests
discrete numeric variable / Univariate data
DOM (Document Object Model) / XML
double exponential smoothing / Double exponential smoothing
dplyr
- used, for data manipulation / Using dplyr and tidyr to manipulate data
- data, loading / Loading data for use in dplyr
- grouping / Grouping in dplyr
- data, joining / Joining data

E

Emacs Speaks Statistics (ESS) / R scripting
ensemble learning / Random forests
entry errors
- checking / Checking for outliers, entry errors, or unlikely data points
Error, Trend, and Seasonal (ETS)
- space model / ETS and the state space model
errors, NHST / Errors in NHST
estimate / Populations, samples, and estimation
Expectation Maximization (EM) method / Stochastic regression imputation

F

FastR / Using another R implementation
flow of control construct / Flow of control
forecasting
- about / What is forecasting?
- uncertainity / Uncertainty
- difficulties / Difficulties in forecasting
forking / Getting started with parallel R
frequency distributions / Frequency distributions
functional programming
- about / Functional programming as a main tidyverse principle
- data, loading in dplyr / Loading data for use in dplyr
- rows, manipulating / Manipulating rows
- columns, selecting / Selecting and renaming columns
- columns, renaming / Selecting and renaming columns
- columns, computing / Computing on columns
- grouping, in dplyr / Grouping in dplyr
- data, joining in dplyr / Joining data
functions / Functions
fuzzy matching / Fuzzy matching

G

Gaussian distribution / Central tendency
Gaussian white noise / White noise
Generalized Additive Models (GAMs) / Advanced topics
Generalized Linear Model (GLM) / Generalized Linear Model (GLM)
Git / Version control

H

H0 (null hypothesis) / The null hypothesis significance testing framework
H1 (alternative hypothesis) / The null hypothesis significance testing framework
Holm-Bonferroni correction / Testing more than two means
hot deck imputation / Hot deck imputation
hyperplane / Multiple regression

I

imperative programming / Functional programming as a main tidyverse principle
independence of proportions
- testing / Testing independence of proportions
independent samples t-test
- assumptions / Assumptions of the independent samples t-test
indexing / Subsetting
Integrated Development Environment (IDE) / R scripting
Intel Math Kernel Library (MKL) / Using another R implementation
interaction terms / Advanced topics
interpretations / A tale of two interpretations
interval estimation
- about / Interval estimation
- qnorm function, using / How did we get 1.96?
Iteratively Re-Weighted Least Squares (IWLS) / A word of warning

J

Jaccard index / Using JSON
JavaScript Object Notation (JSON) / Using JSON
joint distribution / Enter MCMC – stage left
Just Another Gibbs Sampler (JAGS)
- using / Using JAGS and runjags

K

k-fold cross validation / Cross-validation
k-Nearest neighbors
- about / k-Nearest neighbors
- using, in R / Using k-NN in R
- limitations / Limitations of k-NN
kernel density estimation / Probability distributions
kitchen sink regression / Kitchen sink regression

L

lambda functions / Functions
LaTeX / Communicating results
left-tailed distribution / Central tendency
linear models / Linear models
linear regression diagnostics
- about / Linear regression diagnostics
- second Anscombe relationship / Second Anscombe relationship
- third Anscombe relationship / Third Anscombe relationship
- fourth Anscombe relationship / Fourth Anscombe relationship
list-wise deletion / Complete case analysis
logistic regression
- about / Logistic regression
- using, in R / Using logistic regression in R

M

Mann-Whitney U test / What if my assumptions are unfounded?
Markov chain Monte Carlo (MCMC) / Enter MCMC – stage left
mathematical operators
- arithmetic / Arithmetic and assignment
- assignments / Arithmetic and assignment
- logical / Logicals and characters
- characters / Logicals and characters
matrices / Matrices
Maximum Likelihood Estimate (MLE) / The big idea behind Bayesian analysis, Logistic regression
mean height
- estimating / Estimating means
mean of one sample
- testing / Testing the mean of one sample
Mean Squared Error (MSE) / Simple linear regression
mean substitution / Mean substitution
Mercurial / Version control
methods, for missing data
- complete case analysis / Complete case analysis
- pairwise deletion / Pairwise deletion
- mean substitution / Mean substitution
- hot deck imputation / Hot deck imputation
- regression imputation / Regression imputation
- stochastic regression imputation / Stochastic regression imputation
- multiple imputation / Multiple imputation
mice
- imputed values, obtaining / So how does mice come up with the imputed values?
- methods of imputation, using / Methods of imputation
- multiple imputation, using / Multiple imputation in practice
- reference / Multiple imputation in practice
Missing At Random (MAR) / Types of missing data
Missing Completely At Random (MCAR) / Types of missing data
missing data
- analysis / Analysis with missing data
- visualizing / Visualizing missing data
- Missing Completely At Random (MCAR) / Types of missing data
- Missing At Random (MAR) / Types of missing data
- Missing Not At Random (MNAR) / Types of missing data
- dataset, assumption / So which one is it?
Missing Not At Random (MNAR) / Types of missing data
Monte Carlo case resampling / What have we left out?
Monte Carlo simulation / Who cares about coin flips
multiple correlations
- comparing / Comparing multiple correlations
multiple imputation / Multiple imputation
multiple means
- testing / Testing more than two means
multiple regression / Multiple regression
multivariate data / Multivariate data
MusicBrainz
- reference / XML

N

negatively skewed distribution / Central tendency
non-linear modeling / Advanced topics
normal distribution
- about / The normal distribution
- three-sigma rule / The three-sigma rule and using z-tables
- z-tables, using / The three-sigma rule and using z-tables
Not a Number (NaN) / Arithmetic and assignment
null hypothesis / The null hypothesis significance testing framework
Null Hypothesis Significance Testing (NHST)
- about / The null hypothesis significance testing framework
- one-tailed test / One and two-tailed tests
- two-tailed tests / One and two-tailed tests
- errors / Errors in NHST
- warning, about significance / A warning about significance
- p-values / A warning about p-values

O

one-sample test
- of means / A one-sample test of means
one sample t-test
- about / Testing the mean of one sample
- assumptions / Assumptions of the one sample t-test
online repositories / Online repositories
OpenBLAS / Using another R implementation
OpenRefine / OpenRefine
optimization / Wait to optimize
optimized packages
- using / Using optimized packages
ordinal variable / Multiple imputation in practice
Out-Of-Bag (OOB) error rate / Random forests
out-of-bounds data
- checking / Checking for out-of-bounds data
outliers
- checking / Checking for outliers, entry errors, or unlikely data points

P

p-values / A warning about p-values
packages
- working with / Working with packages
package version management / Package version management
packrat / Package version management
pairwise deletion / Pairwise deletion
parallelization
- using / Using parallelization
- in R / Getting started with parallel R
- example / An example of (some) substance
parameters / Parameters
polymorphism / Loading data into R
population / Populations, samples, and estimation
positively skewed distribution / Central tendency
pqR / Using another R implementation
predictive mean matching / Methods of imputation
prior
- about / Basic probability
- selecting / Choosing a prior
probability / Basic probability
probability density function (PDF) / Probability distributions
probability distribution
- about / Probability distributions
- sampling / Sampling from distributions
- parameters / Parameters
- binomial distribution / The binomial distribution
probability mass function (PMF) / Probability distributions

Q

QQ-plot (quantile-quantile plot) / What if my assumptions are unfounded?
quantile / How did we get 1.96?

R

R
- help, obtaining / Getting help in R
- data, loading into / Loading data into R
- bootstrap, performing / Performing the bootstrap in R (more elegantly)
- k-Nearest neighbors, using / Using k-NN in R
- logistic regression, using / Using logistic regression in R
random forests / Random forests
Rcpp
- using / Using Rcpp
Rcpp FAQ
- reference / Using Rcpp
regression
- about / Correlation coefficients
- with non-binary predictor / Regression with a non-binary predictor
regression imputation / Regression imputation
regular expressions
- about / Regular expressions, What are regular expressions?, Getting started
- for data normalization / Regex for data normalization
- normalization / More normalization
regularization / Advanced topics
relational databases / Relational databases
Renjin / Using another R implementation
REPL (Read-Evaluate-Print-Loop) / Navigating the basics
Residual Sum of Squares (RSS) / Simple linear regression
Revolution R Enterprise / Using another R implementation
Revolution R Open / Using another R implementation
right-tailed distribution / Central tendency
Root Mean Squared Error (RMSE) / Simple linear regression
rows
- manipulating / Manipulating rows
R projects / R projects
R scripts
- about / R scripting
- executing / Running R scripts
- example / An example script
- reproducibility / Scripting and reproducibility
- scripting / Scripting and reproducibility
RStudio
- about / RStudio
- reference / RStudio
Rtools
- reference / Using Rcpp
runjags
- using / Using JAGS and runjags

S

sampling distribution / The sampling distribution
sampling with replacement / What's... uhhh... the deal with the bootstrap?
sanity test / Multiple imputation in practice
simple exponential smoothing
- for forecasting / Simple exponential smoothing for forecasting
simple linear regression
- about / Simple linear regression
- with binary predictor / Simple linear regression with a binary predictor
- warning / A word of warning
Simpson's Paradox / Relationships between two categorical variables
smaller sample / Smaller samples
smoothing
- about / Smoothing
- accuracy assessment / Accuracy assessment
- double exponential smoothing / Double exponential smoothing
- triple exponential smoothing / Triple exponential smoothing
Spearman's rho / Correlation coefficients
spread operation / Spread
standard deviation / Spread
standard error / The sampling distribution
standard evaluation / Selecting and renaming columns
standardization / The three-sigma rule and using z-tables
stepwise regression / Striking a balance
stochastic regression imputation / Stochastic regression imputation
strings / Logicals and characters
Student's t-distribution / Smaller samples
subscript operator / Subsetting
subsetting / Subsetting
Subversion / Version control

T

TeamDrive / Version control
three-sigma rule / The three-sigma rule and using z-tables
tidyr
- used, for data manipulation / Using dplyr and tidyr to manipulate data
- data, reshaping / Reshaping data with tidyr
Tidy Tools Manifesto
- reference / Using dplyr and tidyr to manipulate data
tidyverse
- about / Using dplyr and tidyr to manipulate data
- reference / Using dplyr and tidyr to manipulate data
- functional programming / Functional programming as a main tidyverse principle
time series
- about / What is a time series?
- creating / Creating and plotting time series
- plotting / Creating and plotting time series
- components / Components of time series
time series decomposition / Time series decomposition
trend line / Correlation coefficients
triple exponential smoothing / Triple exponential smoothing
Tukey's variation / Relationships between a categorical and continuous variable
two-fold cross validation / Cross-validation
two means
- testing / Testing two means

U

unexpected categories
- checking / Checking for unexpected categories
univariate data / Univariate data
unsanitized data
- checking / Checking unsanitized data
- out-of-bounds data, checking / Checking for out-of-bounds data
- data type of column, checking / Checking the data type of a column
- unexpected categories, checking / Checking for unexpected categories
- outliers, checking / Checking for outliers, entry errors, or unlikely data points
- data points, checking / Checking for outliers, entry errors, or unlikely data points
- entry errors, checking / Checking for outliers, entry errors, or unlikely data points
- assertions, chaining / Chaining assertions

V

Variance Inflation Factor (VIF) / Fourth Anscombe relationship
VCD (Visualizing Categorical Data) / Two categorical variables
vectorization / Vectorization
vectorized functions / Vectorized functions
vectors
- about / Vectors
- subsetting / Subsetting
- advanced subsetting / Advanced subsetting
- recycling / Recycling
version control
- about / Version control
- package version management / Package version management
visualization methods
- about / Visualization methods, Visualization methods
- categorical, and continuous variables / Categorical and continuous variables
- two categorical variables / Two categorical variables
- two continuous variables / Two continuous variables
- multiple continuous variables / More than two continuous variables

W

Web Technologies Task View
- reference / Other data formats
white noise / White noise

X

XML
- using / XML
XPath
- reference / XML
- using / XML

Z

z-tables
- using / The three-sigma rule and using z-tables