Data Analysis with R, Second Edition

Data Analysis with R, Second Edition - Second Edition

Overview of this book

Frequently the tool of choice for academics, R has spread deep into the private sector and can be found in the production pipelines at some of the most advanced and successful enterprises. The power and domain-specificity of R allows the user to express complex analytics easily, quickly, and succinctly. Starting with the basics of R and statistical reasoning, this book dives into advanced predictive analytics, showing how to apply those techniques to real-world data though with real-world examples. Packed with engaging problems and exercises, this book begins with a review of R and its syntax with packages like Rcpp, ggplot2, and dplyr. From there, get to grips with the fundamentals of applied statistics and build on this knowledge to perform sophisticated and powerful analytics. Solve the difficulties relating to performing data analysis in practice and find solutions to working with messy data, large data, communicating results, and facilitating reproducibility. This book is engineered to be an invaluable resource through many stages of anyone’s career as a data analyst.

Title Page

Packt Upsell

Contributors

Preface

Free Chapter

RefresheR

Navigating the basics

Vectors

Working with packages

Exercises

Summary

The Shape of Data

Univariate data

Frequency distributions

Central tendency

Spread

Populations, samples, and estimation

Probability distributions

Visualization methods

Exercises

Summary

Describing Relationships

Multivariate data

Relationships between a categorical and continuous variable

Relationships between two categorical variables

The relationship between two continuous variables

Visualization methods

Exercises

Summary

Probability

Basic probability

A tale of two interpretations

Sampling from distributions

The normal distribution

Exercises

Summary

Using Data To Reason About The World

Estimating means

The sampling distribution

Interval estimation

Smaller samples

Exercises

Summary

Testing Hypotheses

The null hypothesis significance testing framework

Testing the mean of one sample

Testing two means

Testing more than two means

Testing independence of proportions

What if my assumptions are unfounded?

Exercises

Summary

Bayesian Methods

The big idea behind Bayesian analysis

Choosing a prior

Who cares about coin flips

Enter MCMC – stage left

Using JAGS and runjags

Fitting distributions the Bayesian way

The Bayesian independent samples t-test

Exercises

Summary

The Bootstrap

What's... uhhh... the deal with the bootstrap?

Performing the bootstrap in R (more elegantly)

Confidence intervals

A one-sample test of means

Bootstrapping statistics other than the mean

Busting bootstrap myths

Exercises

Summary

Predicting Continuous Variables

Linear models

Simple linear regression

Simple linear regression with a binary predictor

Multiple regression

Regression with a non-binary predictor

Kitchen sink regression

The bias-variance trade-off

Linear regression diagnostics

Advanced topics

Exercises

Summary

Predicting Categorical Variables

Choosing a classifier

Exercises

Summary

Predicting Changes with Time

What is a time series?

What is forecasting?

Creating and plotting time series

Components of time series

Time series decomposition

White noise

Autocorrelation

Smoothing

ETS and the state space model

Interventions for improvement

What we didn't cover

Citations for the climate change data

Exercises

Summary

Sources of Data

XML

Summary

Dealing with Missing Data

Analysis with missing data

Visualizing missing data

Types of missing data

Unsophisticated methods for dealing with missing data

So how does mice come up with the imputed values?

Exercises

Summary

Dealing with Messy Data

Checking unsanitized data

Regular expressions

Other tools for messy data

Exercises

Summary

Dealing with Large Data

Wait to optimize

Using a bigger and faster machine

Be smart about your code

Using optimized packages

Using another R implementation

Using parallelization

Using Rcpp

Being smarter about your code

Exercises

Summary

Working with Popular R Packages

The data.table package

Using dplyr and tidyr to manipulate data

Functional programming as a main tidyverse principle

Reshaping data with tidyr

Exercises

Summary

Reproducibility and Best Practices

R scripting

R projects

Version control

Communicating results

Exercises

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Preface

I'm going to shoot it to you straight. There are a lot of books about data analysis and the R programming language. I'll take it for granted that you already know why it's extremely helpful and fruitful to learn R and data analysis (if not, why are you reading this preface?!) but allow me to make a case for choosing this book to guide you in your journey.

For one, this subject didn't come naturally to me. There are those with an innate talent for grasping the intricacies of statistics the first time it is taught to them; I don't think I'm one of them. I kept at it because I love science and research, and I knew that data analysis was necessary, not because it immediately made sense to me. Today, I love the subject in and of itself rather than instrumentally, but this came only after months of heartache. Eventually, as I consumed resource after resource, the pieces of the puzzle started to come together. After this, I started tutoring interested friends in the subject—and have seen them trip over the same obstacles that I had to learn to climb. I think that coming from this background gives me a unique perspective of the plight of the statistics student and it allows me to reach them in a way that others may not be able to. By the way, don't let the fact that statistics used to baffle me scare you; I have it on fairly good authority that I know what I'm talking about today.

Secondly, this book was born of the frustration that most statistics texts tend to be written in the driest manner possible. In contrast, I adopt a light-hearted buoyant approach—but without becoming agonizingly flippant.

Third, this book includes a lot of material that I wished were covered in more of the resources I used when I was learning data analysis in R. For example, the entire last unit specifically covers topics that present enormous challenges to R analysts when they first go out to apply their knowledge to imperfect real-world data.

Lastly, I thought long and hard about how to lay out this book and which order of topics was optimal. And when I say "long and hard," I mean I wrote a library and designed algorithms to do this. The order in which I present the topics in this book was very carefully considered to (a) build on top of each other, (b) follow a reasonable level of difficulty progression allowing for periodic chapters of relatively simpler material (psychologists call this intermittent reinforcement), (c) group highly related topics together, and (d) minimize the number of topics that require knowledge of yet unlearned topics (this is, unfortunately, common in statistics). If you're interested, I've detailed this procedure in a blog post that you can read at http://bit.ly/teach-stats.

The point is that the book you're holding is a very special one—one that I poured my soul into. Nevertheless, data analysis can be a notoriously difficult subject, and there may be times where nothing seems to make sense. During these times, remember that many others (including myself) have felt stuck too. Persevere... the reward is great. And remember, if a blockhead like me can do it, you can too. Go you!

Who this book is for

Whether you are learning data analysis for the first time or you want to deepen the understanding you already have, this book will prove an invaluable resource. If you are looking for a book to bring you all the way through the fundamentals to the application of advanced and effective analytics methodologies—and if you have some prior programming experience and a mathematical background—then this is for you.

What this book covers

Chapter 1, RefresheR, reviews the aspects of R that subsequent chapters will assume knowledge of. Here, we learn the basics of R syntax, learn of R's major data structures, write functions, load data, and install packages.

Chapter 2, The Shape of Data, discusses univariate data. We learn about different data types, how to describe univariate data, and how to visualize the shape of this data.

Chapter 3, Describing Relationships, covers multivariate data. In particular, we learn about the three main classes of bivariate relationships and learn how to describe them.

Chapter 4, Probability, kicks off a new unit by laying its foundations. We learn about basic probability theory, Bayes' theorem, and probability distributions.

Chapter 5, Using Data to Reason about the World, discusses sampling and estimation theory. Through examples, we learn of the central limit theorem, point estimation, and confidence intervals.

Chapter 6, Testing Hypotheses, introduces the subject of Null Hypothesis Significance Testing (NHST). We learn of many popular hypothesis tests and their non-parametric alternatives. Perhaps most importantly, we gain a thorough understanding of the misconceptions and gotchas of NHST.

Chapter 7, Bayesian Methods, presents an alternative to NHST based on a more intuitive view of probability. We learn the advantages and drawbacks of this approach too.

Chapter 8, The Bootstrap, details another approach to NHST by using a technique called resampling. We learn of its advantages and shortcomings. In addition, this chapter serves as a great reinforcement of the material in chapters 5 and 6.

Chapter 9, Predicting Continuous Variables, kicks off our new unit on predictive analytics and thoroughly discusses linear regression. Before the chapter's conclusion, we learn all about the technique, when to use it, and what traps to look out for.

Chapter 10, Predicting Categorical Variables, introduces four of the most popular classification techniques. By using all four on the same examples, we gain an appreciation for what makes each technique shine.

Chapter 11, Predicting Changes with Time, closes our unit of predictive analytics by introducing the topics of time series analysis and forecasting. This ends with a firm foundation on one of the premier methods of time series forecasting.

Chapter 12, Sources of Data, begins the final unit detailing data analysis in the real world. This chapter is all about how to use different data sources in R. In particular, we learn how to interface with databases, and request and load JSON and XML via an engaging example.

Chapter 13, Dealing with Missing Data, details what missing data is, how to identify types of missing data, some not-so-great methods for dealing with them, and two principled methods for handling them.

Chapter 14, Dealing with Messy Data, introduces some of the snags of working with less-than-perfect data in practice. This includes checking for unexpected input, wielding regex, and verifying data veracity with assertr.

Chapter 15, Dealing with Large Data, discusses some of the techniques that can be used to cope with data sets larger than what can be handled swiftly without a little planning. The key components of this chapter are on parallelization and Rcpp.

Chapter 16, Working with Popular R Packages, acknowledges that we’ve already wielded a lot of popular packages in this unit, but this chapter fills in some of the gaps and introduces some of the most modern packages that make speed and ease of use a priority.

Chapter 17, Reproducibility and Best Practices, closes with the extremely important (but often ignored) topic of how to use R like a professional. This includes learning about tooling, organization, and reproducibility.

To get the most out of this book

All code in this book has been written against the latest version of R—3.4.3 at time of writing. As a matter of good practice, you should keep your R version up to date but most, if not all, code should work with any reasonably recent version of R. Some of the R packages we will be installing will require more recent versions though. For the other software that this book uses, instructions will be furnished pro re nata. If you want to get a head start, however, install RStudio, JAGS, and a C++ compiler (or Rtools if you use windows).

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at www.packtpub.com.
Select the SUPPORT tab.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Data-Analysis-with-R-Second-Edition. We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

# don't worry about memorizing this 
temp.density <- density(airquality$Temp) 
pdf <- approxfun(temp.density$x, temp.density$y, rule=2) 
integrate(pdf, 80, 90)

When we wish to draw your attention to a particular part of a code block or output, the relevant lines or items are set in bold:

table(mtcars$carb) / length(mtcars$carb) 
   
      1       2       3       4       6       8  
0.21875 0.31250 0.09375 0.31250 0.03125 0.03125

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."

Note

Warnings or important notes appear like this.

Note

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Data Analysis with R, Second Edition - Second Edition

Data Analysis with R, Second Edition - Second Edition

Overview of this book

Related Content you might be interested in

Current Title:

Data Analysis with R, Second Edition - Second Edition

Advanced Analytics with R and Tableau

Machine Learning with R Cookbook

The Statistics and Machine Learning with R Workshop

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Note

Note

Get in touch

Reviews