Data Analysis with R, Second Edition

Data Analysis with R, Second Edition - Second Edition

Overview of this book

Frequently the tool of choice for academics, R has spread deep into the private sector and can be found in the production pipelines at some of the most advanced and successful enterprises. The power and domain-specificity of R allows the user to express complex analytics easily, quickly, and succinctly. Starting with the basics of R and statistical reasoning, this book dives into advanced predictive analytics, showing how to apply those techniques to real-world data though with real-world examples. Packed with engaging problems and exercises, this book begins with a review of R and its syntax with packages like Rcpp, ggplot2, and dplyr. From there, get to grips with the fundamentals of applied statistics and build on this knowledge to perform sophisticated and powerful analytics. Solve the difficulties relating to performing data analysis in practice and find solutions to working with messy data, large data, communicating results, and facilitating reproducibility. This book is engineered to be an invaluable resource through many stages of anyone’s career as a data analyst.

Title Page

Packt Upsell

Contributors

Preface

Free Chapter

RefresheR

Navigating the basics

Vectors

Working with packages

Exercises

Summary

The Shape of Data

Univariate data

Frequency distributions

Central tendency

Spread

Populations, samples, and estimation

Probability distributions

Visualization methods

Exercises

Summary

Describing Relationships

Multivariate data

Relationships between a categorical and continuous variable

Relationships between two categorical variables

The relationship between two continuous variables

Visualization methods

Exercises

Summary

Probability

Basic probability

A tale of two interpretations

Sampling from distributions

The normal distribution

Exercises

Summary

Using Data To Reason About The World

Estimating means

The sampling distribution

Interval estimation

Smaller samples

Exercises

Summary

Testing Hypotheses

The null hypothesis significance testing framework

Testing the mean of one sample

Testing two means

Testing more than two means

Testing independence of proportions

What if my assumptions are unfounded?

Exercises

Summary

Bayesian Methods

The big idea behind Bayesian analysis

Choosing a prior

Who cares about coin flips

Enter MCMC – stage left

Using JAGS and runjags

Fitting distributions the Bayesian way

The Bayesian independent samples t-test

Exercises

Summary

The Bootstrap

What's... uhhh... the deal with the bootstrap?

Performing the bootstrap in R (more elegantly)

Confidence intervals

A one-sample test of means

Bootstrapping statistics other than the mean

Busting bootstrap myths

Exercises

Summary

Predicting Continuous Variables

Linear models

Simple linear regression

Simple linear regression with a binary predictor

Multiple regression

Regression with a non-binary predictor

Kitchen sink regression

The bias-variance trade-off

Linear regression diagnostics

Advanced topics

Exercises

Summary

Predicting Categorical Variables

Choosing a classifier

Exercises

Summary

Predicting Changes with Time

What is a time series?

What is forecasting?

Creating and plotting time series

Components of time series

Time series decomposition

White noise

Autocorrelation

Smoothing

ETS and the state space model

Interventions for improvement

What we didn't cover

Citations for the climate change data

Exercises

Summary

Sources of Data

XML

Summary

Dealing with Missing Data

Analysis with missing data

Visualizing missing data

Types of missing data

Unsophisticated methods for dealing with missing data

So how does mice come up with the imputed values?

Exercises

Summary

Dealing with Messy Data

Checking unsanitized data

Regular expressions

Other tools for messy data

Exercises

Summary

Dealing with Large Data

Wait to optimize

Using a bigger and faster machine

Be smart about your code

Using optimized packages

Using another R implementation

Using parallelization

Using Rcpp

Being smarter about your code

Exercises

Summary

Working with Popular R Packages

The data.table package

Using dplyr and tidyr to manipulate data

Functional programming as a main tidyverse principle

Reshaping data with tidyr

Exercises

Summary

Reproducibility and Best Practices

R scripting

R projects

Version control

Communicating results

Exercises

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Loading data into R

Thus far, we've only been entering data directly into the interactive R console. For any dataset of non-trivial size, this is, obviously, an intractable solution. Fortunately for us, R has a robust suite of functions to read data directly from external files.

Go ahead and create a file on your hard disk called favorites.txt that looks like this:

flavor,number 
pistachio,6 
mint chocolate chip,7 
vanilla,5
chocolate,10 
strawberry,2 
neopolitan,4

This data represents the number of students in a class that prefer a particular flavor of soy ice cream. We can read the file into a variable called favs as follows:

 > favs <- read.table("favorites.txt", sep=",", header=TRUE)

If you get an error that there is no such file or directory, give R the full path name to your dataset or, alternatively, run the following command:

 > favs <- read.table(file.choose(), sep=",", header=TRUE)

The preceding command brings up an open file dialog to let you navigate to the file that you've just created.

The sep=","argument tells R that each data element in a row is separated by a comma. Other common data formats have values separated by tabs and pipes ("|"). The value of sep should then be "\t" and "|", respectively.

The header=TRUEargument tells R that the first row of the file should be interpreted as the names of the columns. Remember, you can enter ?read.table at the console to learn more about these options.

Reading from files in this comma-separated values format (usually with the .csv file extension) is so common that R has a more specific function just for it. The preceding data import expression can be best written simply as follows:

 > favs <- read.csv("favorites.txt")

Now, we have all the data in the file held in a variable of the data.frame class. A data frame can be thought of as a rectangular array of data that you might see in a spreadsheet application. In this way, a data frame can also be thought of as a matrix; indeed, we can use matrix-style indexing to extract elements from it. A data frame differs from a matrix, though, in that a data frame may have columns of differing types. For example, whereas a matrix would only allow one of these types, the dataset that we just loaded contains character data in its first column and numeric data in its second column.

Let's check out what we have using the head() command, which will show us the first few lines of a data frame:

  > head(favs) 
                 flavor number 
  1           pistachio      6 
  2 mint chocolate chip      7 
  3             vanilla      5 
  4           chocolate     10 
  5          strawberry      2 
  6          neopolitan      4 
 
  > class(favs) 
  [1] "data.frame" 
  > class(favs$flavor) 
  [1] "factor" 
  > class(favs$number) 
  [1] "numeric"

I lied, okay! So what?! Technically, flavor is a factor data type, not a character type.

We haven't seen factors yet, but the idea behind them is really simple. Essentially, factors are codings for categorical variables, which are variables that take on one of a finite number of categories--think {"high", "medium", and "low"} or {"control", "experimental"}.

Though factors are extremely useful in statistical modeling in R, the fact that R, by default, automatically interprets a column from the data read from disk as a type factor if it contains characters is something that trips up novices and seasoned useRs alike. Due to this, we will primarily prevent this behavior manually by adding the stringsAsFactors optional keyword argument to the read.* commands:

  > favs <- read.csv("favorites.txt", stringsAsFactors=FALSE) 
  > class(favs$flavor) 
  [1] "character"

Much better, for now! If you'd like to make this behavior the new default, read the ?options manual page. We can always convert to factors later on if we need to!

If you haven't noticed already, I've snuck a new operator on you--$, the extract operator. This is the most popular way to extract attributes (or columns) from a data frame. You can also use double square brackets ([[ and ]]) to do this.

These are both in addition to the canonical matrix indexing option. The following three statements are thus, in this context, functionally identical:

  > favs$flavor 
  [1] "pistachio"           "mint chocolate chip" "vanilla"             
  [4] "chocolate"           "strawberry"          "neopolitan"          
  > favs[["flavor"]] 
  [1] "pistachio"           "mint chocolate chip" "vanilla"             
  [4] "chocolate"           "strawberry"          "neopolitan"          
  > favs[,1] 
  [1] "pistachio"           "mint chocolate chip" "vanilla"             
  [4] "chocolate"           "strawberry"          "neopolitan"

Note

Notice how R has now printed another number in square brackets--besides [1]--along with our output. This is to show us that chocolate is the fourth element of the vector that was returned from the extraction.

You can use the names() function to get a list of the columns available in a data frame. You can even reassign names using the same:

  > names(favs) 
  [1] "flavor" "number" 
  > names(favs)[1] <- "flav" 
  > names(favs) 
  [1] "flav"   "number"

Lastly, we can get a compact display of the structure of a data frame using the str() function on it:

  > str(favs) 
  'data.frame': 6 obs. of  2 variables: 
   $ flav  : chr  "pistachio" "mint chocolate chip" "vanilla"   
   "chocolate" ... 
   $ number: num  6 7 5 10 2 4

Actually, you can use this function on any R structure--the property of functions that change their behavior based on the type of input is called polymorphism.

Data Analysis with R, Second Edition - Second Edition

Data Analysis with R, Second Edition - Second Edition

Overview of this book

Related Content you might be interested in

Current Title:

Data Analysis with R, Second Edition - Second Edition

Advanced Analytics with R and Tableau

Machine Learning with R Cookbook

The Statistics and Machine Learning with R Workshop

Loading data into R

Note