Book Image

Data Analysis with R, Second Edition - Second Edition

Book Image

Data Analysis with R, Second Edition - Second Edition

Overview of this book

Frequently the tool of choice for academics, R has spread deep into the private sector and can be found in the production pipelines at some of the most advanced and successful enterprises. The power and domain-specificity of R allows the user to express complex analytics easily, quickly, and succinctly. Starting with the basics of R and statistical reasoning, this book dives into advanced predictive analytics, showing how to apply those techniques to real-world data though with real-world examples. Packed with engaging problems and exercises, this book begins with a review of R and its syntax with packages like Rcpp, ggplot2, and dplyr. From there, get to grips with the fundamentals of applied statistics and build on this knowledge to perform sophisticated and powerful analytics. Solve the difficulties relating to performing data analysis in practice and find solutions to working with messy data, large data, communicating results, and facilitating reproducibility. This book is engineered to be an invaluable resource through many stages of anyone’s career as a data analyst.
Table of Contents (24 chapters)
Title Page
Copyright and Credits
Packt Upsell


I'm going to shoot it to you straight. There are a lot of books about data analysis and the R programming language. I'll take it for granted that you already know why it's extremely helpful and fruitful to learn R and data analysis (if not, why are you reading this preface?!) but allow me to make a case for choosing this book to guide you in your journey.

For one, this subject didn't come naturally to me. There are those with an innate talent for grasping the intricacies of statistics the first time it is taught to them; I don't think I'm one of them. I kept at it because I love science and research, and I knew that data analysis was necessary, not because it immediately made sense to me. Today, I love the subject in and of itself rather than instrumentally, but this came only after months of heartache. Eventually, as I consumed resource after resource, the pieces of the puzzle started to come together. After this, I started tutoring interested friends in the subject—and have seen them trip over the same obstacles that I had to learn to climb. I think that coming from this background gives me a unique perspective of the plight of the statistics student and it allows me to reach them in a way that others may not be able to. By the way, don't let the fact that statistics used to baffle me scare you; I have it on fairly good authority that I know what I'm talking about today.

Secondly, this book was born of the frustration that most statistics texts tend to be written in the driest manner possible. In contrast, I adopt a light-hearted buoyant approach—but without becoming agonizingly flippant.

Third, this book includes a lot of material that I wished were covered in more of the resources I used when I was learning data analysis in R. For example, the entire last unit specifically covers topics that present enormous challenges to R analysts when they first go out to apply their knowledge to imperfect real-world data.

Lastly, I thought long and hard about how to lay out this book and which order of topics was optimal. And when I say "long and hard," I mean I wrote a library and designed algorithms to do this. The order in which I present the topics in this book was very carefully considered to (a) build on top of each other, (b) follow a reasonable level of difficulty progression allowing for periodic chapters of relatively simpler material (psychologists call this intermittent reinforcement), (c) group highly related topics together, and (d) minimize the number of topics that require knowledge of yet unlearned topics (this is, unfortunately, common in statistics). If you're interested, I've detailed this procedure in a blog post that you can read at

The point is that the book you're holding is a very special one—one that I poured my soul into. Nevertheless, data analysis can be a notoriously difficult subject, and there may be times where nothing seems to make sense. During these times, remember that many others (including myself) have felt stuck too. Persevere... the reward is great. And remember, if a blockhead like me can do it, you can too. Go you!

Who this book is for

Whether you are learning data analysis for the first time or you want to deepen the understanding you already have, this book will prove an invaluable resource. If you are looking for a book to bring you all the way through the fundamentals to the application of advanced and effective analytics methodologies—and if you have some prior programming experience and a mathematical background—then this is for you.

What this book covers

Chapter 1, RefresheR, reviews the aspects of R that subsequent chapters will assume knowledge of. Here, we learn the basics of R syntax, learn of R's major data structures, write functions, load data, and install packages.

Chapter 2, The Shape of Data, discusses univariate data. We learn about different data types, how to describe univariate data, and how to visualize the shape of this data.

Chapter 3, Describing Relationships, covers multivariate data. In particular, we learn about the three main classes of bivariate relationships and learn how to describe them.

Chapter 4, Probability, kicks off a new unit by laying its foundations. We learn about basic probability theory, Bayes' theorem, and probability distributions.

Chapter 5, Using Data to Reason about the World, discusses sampling and estimation theory. Through examples, we learn of the central limit theorem, point estimation, and confidence intervals.

Chapter 6, Testing Hypotheses, introduces the subject of Null Hypothesis Significance Testing (NHST). We learn of many popular hypothesis tests and their non-parametric alternatives. Perhaps most importantly, we gain a thorough understanding of the misconceptions and gotchas of NHST.

Chapter 7, Bayesian Methods, presents an alternative to NHST based on a more intuitive view of probability. We learn the advantages and drawbacks of this approach too.

Chapter 8, The Bootstrap, details another approach to NHST by using a technique called resampling. We learn of its advantages and shortcomings. In addition, this chapter serves as a great reinforcement of the material in chapters 5 and 6.

Chapter 9, Predicting Continuous Variables, kicks off our new unit on predictive analytics and thoroughly discusses linear regression. Before the chapter's conclusion, we learn all about the technique, when to use it, and what traps to look out for.

Chapter 10, Predicting Categorical Variables, introduces four of the most popular classification techniques. By using all four on the same examples, we gain an appreciation for what makes each technique shine.

Chapter 11, Predicting Changes with Time, closes our unit of predictive analytics by introducing the topics of time series analysis and forecasting. This ends with a firm foundation on one of the premier methods of time series forecasting.

Chapter 12, Sources of Data, begins the final unit detailing data analysis in the real world.  This chapter is all about how to use different data sources in R. In particular, we learn how to interface with databases, and request and load JSON and XML via an engaging example.

Chapter 13, Dealing with Missing Data, details what missing data is, how to identify types of missing data, some not-so-great methods for dealing with them, and two principled methods for handling them.

Chapter 14, Dealing with Messy Data, introduces some of the snags of working with less-than-perfect data in practice. This includes checking for unexpected input, wielding regex, and verifying data veracity with assertr.

Chapter 15, Dealing with Large Data, discusses some of the techniques that can be used to cope with data sets larger than what can be handled swiftly without a little planning. The key components of this chapter are on parallelization and Rcpp.

Chapter 16, Working with Popular R Packages, acknowledges that we’ve already wielded a lot of popular packages in this unit, but this chapter fills in some of the gaps and introduces some of the most modern packages that make speed and ease of use a priority.

Chapter 17, Reproducibility and Best Practices, closes with the extremely important (but often ignored) topic of how to use R like a professional. This includes learning about tooling, organization, and reproducibility.

To get the most out of this book

All code in this book has been written against the latest version of R3.4.3 at time of writing. As a matter of good practice, you should keep your R version up to date but most, if not all, code should work with any reasonably recent version of R. Some of the R packages we will be installing will require more recent versions though. For the other software that this book uses, instructions will be furnished pro re nata. If you want to get a head start, however, install RStudio, JAGS, and a C++ compiler (or Rtools if you use windows).

Download the example code files

You can download the example code files for this book from your account at If you purchased this book elsewhere, you can visit and register to have the files emailed directly to you.

You can download the code files by following these steps:

  1. Log in or register at
  2. Select the SUPPORT tab.
  3. Click on Code Downloads & Errata.
  4. Enter the name of the book in the Search box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR/7-Zip for Windows
  • Zipeg/iZip/UnRarX for Mac
  • 7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at We also have other code bundles from our rich catalog of books and videos available at Check them out!

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

# don't worry about memorizing this 
temp.density <- density(airquality$Temp) 
pdf <- approxfun(temp.density$x, temp.density$y, rule=2) 
integrate(pdf, 80, 90)

When we wish to draw your attention to a particular part of a code block or output, the relevant lines or items are set in bold:

table(mtcars$carb) / length(mtcars$carb) 
      1       2       3       4       6       8  
0.21875 0.31250 0.09375 0.31250 0.03125 0.03125

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."


Warnings or important notes appear like this.


Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit


Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit