Data Analysis with R, Second Edition

Data Analysis with R, Second Edition - Second Edition

Overview of this book

Frequently the tool of choice for academics, R has spread deep into the private sector and can be found in the production pipelines at some of the most advanced and successful enterprises. The power and domain-specificity of R allows the user to express complex analytics easily, quickly, and succinctly. Starting with the basics of R and statistical reasoning, this book dives into advanced predictive analytics, showing how to apply those techniques to real-world data though with real-world examples. Packed with engaging problems and exercises, this book begins with a review of R and its syntax with packages like Rcpp, ggplot2, and dplyr. From there, get to grips with the fundamentals of applied statistics and build on this knowledge to perform sophisticated and powerful analytics. Solve the difficulties relating to performing data analysis in practice and find solutions to working with messy data, large data, communicating results, and facilitating reproducibility. This book is engineered to be an invaluable resource through many stages of anyone’s career as a data analyst.

Title Page

Packt Upsell

Contributors

Preface

Free Chapter

RefresheR

Navigating the basics

Vectors

Working with packages

Exercises

Summary

The Shape of Data

Univariate data

Frequency distributions

Central tendency

Spread

Populations, samples, and estimation

Probability distributions

Visualization methods

Exercises

Summary

Describing Relationships

Multivariate data

Relationships between a categorical and continuous variable

Relationships between two categorical variables

The relationship between two continuous variables

Visualization methods

Exercises

Summary

Probability

Basic probability

A tale of two interpretations

Sampling from distributions

The normal distribution

Exercises

Summary

Using Data To Reason About The World

Estimating means

The sampling distribution

Interval estimation

Smaller samples

Exercises

Summary

Testing Hypotheses

The null hypothesis significance testing framework

Testing the mean of one sample

Testing two means

Testing more than two means

Testing independence of proportions

What if my assumptions are unfounded?

Exercises

Summary

Bayesian Methods

The big idea behind Bayesian analysis

Choosing a prior

Who cares about coin flips

Enter MCMC – stage left

Using JAGS and runjags

Fitting distributions the Bayesian way

The Bayesian independent samples t-test

Exercises

Summary

The Bootstrap

What's... uhhh... the deal with the bootstrap?

Performing the bootstrap in R (more elegantly)

Confidence intervals

A one-sample test of means

Bootstrapping statistics other than the mean

Busting bootstrap myths

Exercises

Summary

Predicting Continuous Variables

Linear models

Simple linear regression

Simple linear regression with a binary predictor

Multiple regression

Regression with a non-binary predictor

Kitchen sink regression

The bias-variance trade-off

Linear regression diagnostics

Advanced topics

Exercises

Summary

Predicting Categorical Variables

Choosing a classifier

Exercises

Summary

Predicting Changes with Time

What is a time series?

What is forecasting?

Creating and plotting time series

Components of time series

Time series decomposition

White noise

Autocorrelation

Smoothing

ETS and the state space model

Interventions for improvement

What we didn't cover

Citations for the climate change data

Exercises

Summary

Sources of Data

XML

Summary

Dealing with Missing Data

Analysis with missing data

Visualizing missing data

Types of missing data

Unsophisticated methods for dealing with missing data

So how does mice come up with the imputed values?

Exercises

Summary

Dealing with Messy Data

Checking unsanitized data

Regular expressions

Other tools for messy data

Exercises

Summary

Dealing with Large Data

Wait to optimize

Using a bigger and faster machine

Be smart about your code

Using optimized packages

Using another R implementation

Using parallelization

Using Rcpp

Being smarter about your code

Exercises

Summary

Working with Popular R Packages

The data.table package

Using dplyr and tidyr to manipulate data

Functional programming as a main tidyverse principle

Reshaping data with tidyr

Exercises

Summary

Reproducibility and Best Practices

R scripting

R projects

Version control

Communicating results

Exercises

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Navigating the basics

In the interactive R interpreter, any line starting with a > character denotes R asking for input. (If you see a + prompt, it means that you didn't finish typing a statement at the prompt and R is asking you to provide the rest of the expression). Striking the return key will send your input to R to be evaluated. R's response is then spit back at you in the line immediately following your input, after which R asks for more input. This is called a REPL (Read-Evaluate-Print-Loop). It is also possible for R to read a batch of commands saved in a file (unsurprisingly called batch mode), but we'll be using the interactive mode for most of the book.

As you might imagine, R supports all the familiar mathematical operators as with most other languages.

Arithmetic and assignment

Check out the following example:

  > 2 + 2 
  [1] 4 
 
  > 9 / 3 
  [1] 3 
 
  > 5 %% 2    # modulus operator (remainder of 5 divided by 2) 
  [1] 1

Anything that occurs after the octothorpe or pound sign, #, (or hash-tag for you young'uns), is ignored by the R interpreter. This is useful to document the code in natural language. These are called comments.

In a multi-operation arithmetic expression, R will follow the standard order of operations from math. In order to override this natural order, you have to use parentheses flanking the sub-expression that you'd like to be performed first:

   > 3 + 2 - 10 ^ 2        # ^ is the exponent operator 
   [1] -95 
   > 3 + (2 - 10) ^ 2 
   [1] 67

In practice, almost all compound expressions are split up with intermediate values assigned to variables that, when used in future expressions, are just like substituting the variable with the value that was assigned to it. The (primary) assignment operator is <-:

   > # assignments follow the form VARIABLE <- VALUE 
   > var <- 10 
   > var 
   [1] 10 
   > var ^ 2 
   [1] 100 
   > VAR / 2             # variable names are case-sensitive  
   Error: object 'VAR' not found

Notice that the first and second lines in the preceding code snippet didn't have an output to be displayed, so R just immediately asked for more input. This is because assignments don't have a return value. Their only job is to give a value to a variable or change the existing value of a variable. Generally, operations and functions on variables in R don't change the value of the variable. Instead, they return the result of the operation. If you want to change a variable to the result of an operation using that variable, you have to reassign that variable as follows:

   > var               # var is 10 
   [1] 10 
   > var ^ 2 
   [1] 100 
   > var               # var is still 10 
   [1] 10 
   > var <- var ^ 2    # no return value 
   > var               # var is now 100 
   [1] 100

Be aware that variable names may contain numbers, underscores, and periods; this is something that trips up a lot of people who are familiar with other programming languages that disallow using periods in variable names. The only further restrictions on variable names are that they must start with a letter (or a period and then a letter), and that it must not be one of the reserved words in R such as TRUE, Inf, and so on.

Although the arithmetic operators that we've seen thus far are functions in their own right, most functions in R take the form, function_name(value(s) supplied to the function). The values supplied to the function are called arguments of that function:

   > cos(3.14159)      # cosine function 
   [1] -1 
   > cos(pi)           # pi is a constant that R provides 
   [1] -1 
   > acos(-1)          # arccosine function 
   [1] 3.141593
   > acos(cos(pi)) + 10  
   [1] 13.14159 
   > # functions can be used as arguments to other functions

If you paid attention in math class, you'll know that the cosine of pi is -1 and that arccosine is the inverse function of cosine.

There are hundreds of such useful functions defined in base R, only a handful of which we will see in this book. Two sectionsfrom now, we will be building our very own functions.

Before we move on from arithmetic, it will serve us well to visit some of the odd values that may result from certain operations:

   > 1 / 0 
   [1] Inf 
   > 0 / 0 
   [1] NaN

It is common during practical usage of R to accidentally divide by zero. As you can see, this undefined operation yields an infinite value in R. Dividing zero by zero yields the value NaN, which stands for Not a Number.

Logicals and characters

So far, we've only been dealing with numerics, but there are other atomic data types in R:

   > foo <- TRUE        # foo is of the logical data type 
   > class(foo)         # class() tells us the type 
   [1] "logical" 
   > bar <- "hi!"       # bar is of the character data type 
   > class(bar) 
   [1] "character"

The logical data type (also called Booleans) can hold the values TRUE or FALSE or, equivalently, T or F. The familiar operators from Boolean algebra are defined for these types:

   > foo 
   [1] TRUE 
   > foo && TRUE                 # boolean and 
   [1] TRUE 
   > foo && FALSE 
   [1] FALSE 
   > foo || FALSE                # boolean or 
   [1] TRUE 
   > !foo                        # negation operator 
   [1] FALSE

In a Boolean expression with a logical value and a number, any number that is not 0 is interpreted as TRUE:

   > foo && 1 
   [1] TRUE 
   > foo && 2 
   [1] TRUE 
   > foo && 0 
   [1] FALSE

Additionally, there are functions and operators that return logical values such as the following:

   > 4 < 2           # less than operator 
   [1] FALSE 
   > 4 >= 4          # greater than or equal to 
   [1] TRUE 
   > 3 == 3          # equality operator 
   [1] TRUE 
   > 3 != 2          # inequality operator 
   [1] TRUE

Just as there are functions in R that are only defined for work on the numeric and logical data type, there are other functions that are designed to work only with the character data type, also known as strings:

   > lang.domain <- "statistics" 
   > lang.domain <- toupper(lang.domain) 
   > print(lang.domain) 
   [1] "STATISTICS" 
   > # retrieves substring from first character to fourth character 
   > substr(lang.domain, 1, 4)           
   [1] "STAT" 
   > gsub("I", "1", lang.domain)  # substitutes every "I" for "1" 
   [1] "STAT1ST1CS" 
   > # combines character strings 
   > paste("R does", lang.domain, "!!!") 
   [1] "R does STATISTICS !!!"

Flow of control

The last topic in this section will be flow of control constructs.

The most basic flow of control construct is the if statement. The argument to an if statement (what goes between the parentheses) is an expression that returns a logical value. The block of code following the if statement gets executed only if the expression yields TRUE:

   > if(2 + 2 == 4) 
   +   print("very good") 
  [1] "very good" 
   > if(2 + 2 == 5) 
   +    print("all hail to the thief")

It is possible to execute more than one statement if an if condition is triggered; you just have to use curly brackets ({}) to contain the statements:

   > if((4/2==2) && (2*2==4)){ 
   +    print("four divided by two is two...") 
   +    print("and two times two is four") 
   + } 
  [1] "four divided by two is two..." 
  [1] "and two times two is four"

It is also possible to specify a block of code that will get executed if the if conditional is FALSE:

   > closing.time <- TRUE 
   > if(closing.time){ 
   +    print("you don't have to go home") 
   +    print("but you can't stay here") 
   + } else{ 
   +    print("you can stay here!") 
   + } 
  [1] "you don't have to go home" 
  [1] "but you can't stay here" 
  > if(!closing.time){ 
  +     print("you don't have to go home") 
  +     print("but you can't stay here") 
  + } else{ 
  +     print("you can stay here!") 
  + } 
  [1] "you can stay here!"

There are other flow of control constructs (like while and for), but we won't be directly using them much in this text.

Data Analysis with R, Second Edition - Second Edition

Data Analysis with R, Second Edition - Second Edition

Overview of this book

Related Content you might be interested in

Current Title:

Data Analysis with R, Second Edition - Second Edition

Advanced Analytics with R and Tableau

Machine Learning with R Cookbook

The Statistics and Machine Learning with R Workshop

Navigating the basics

Arithmetic and assignment

Logicals and characters

Flow of control