Data Analysis with R

Data Analysis with R

Overview of this book

Frequently the tool of choice for academics, R has spread deep into the private sector and can be found in the production pipelines at some of the most advanced and successful enterprises. The power and domain-specificity of R allows the user to express complex analytics easily, quickly, and succinctly. With over 7,000 user contributed packages, it’s easy to find support for the latest and greatest algorithms and techniques. Starting with the basics of R and statistical reasoning, Data Analysis with R dives into advanced predictive analytics, showing how to apply those techniques to real-world data though with real-world examples. Packed with engaging problems and exercises, this book begins with a review of R and its syntax. From there, get to grips with the fundamentals of applied statistics and build on this knowledge to perform sophisticated and powerful analytics. Solve the difficulties relating to performing data analysis in practice and find solutions to working with “messy data”, large data, communicating results, and facilitating reproducibility. This book is engineered to be an invaluable resource through many stages of anyone’s career as a data analyst.

Data Analysis with R

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

RefresheR

Navigating the basics

Vectors

Working with packages

Exercises

Summary

The Shape of Data

Univariate data

Frequency distributions

Central tendency

Spread

Populations, samples, and estimation

Probability distributions

Visualization methods

Exercises

Summary

Describing Relationships

Multivariate data

Relationships between a categorical and a continuous variable

Relationships between two categorical variables

The relationship between two continuous variables

Visualization methods

Exercises

Summary

Probability

Basic probability

A tale of two interpretations

Sampling from distributions

The normal distribution

Exercises

Summary

Using Data to Reason About the World

Estimating means

The sampling distribution

Interval estimation

Smaller samples

Exercises

Summary

Testing Hypotheses

Null Hypothesis Significance Testing

Testing the mean of one sample

Testing two means

Testing more than two means

Testing independence of proportions

What if my assumptions are unfounded?

Exercises

Summary

Bayesian Methods

The big idea behind Bayesian analysis

Choosing a prior

Who cares about coin flips

Enter MCMC – stage left

Using JAGS and runjags

Fitting distributions the Bayesian way

The Bayesian independent samples t-test

Exercises

Summary

Predicting Continuous Variables

Linear models

Simple linear regression

Simple linear regression with a binary predictor

Multiple regression

Regression with a non-binary predictor

Kitchen sink regression

The bias-variance trade-off

Linear regression diagnostics

Advanced topics

Exercises

Summary

Predicting Categorical Variables

Choosing a classifier

Exercises

Summary

Sources of Data

XML

Summary

Dealing with Messy Data

Analysis with missing data

Analysis with unsanitized data

Other messiness

Exercises

Summary

Dealing with Large Data

Wait to optimize

Using a bigger and faster machine

Be smart about your code

Using optimized packages

Using another R implementation

Use parallelization

Using Rcpp

Be smarter about your code

Exercises

Summary

Reproducibility and Best Practices

R Scripting

R projects

Version control

Communicating results

Exercises

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Relationships between two categorical variables

Describing the relationships between two categorical variables is done somewhat less often than the other two broad types of bivariate analyses, but it is just as fun (and useful)!

To explore this technique, we will be using the dataset UCBAdmissions, which contains the data on graduate school applicants to the University of California Berkeley in 1973.

Before we get started, we have to wrap the dataset in a call to data.frame for coercing it into a data frame type variable—I'll explain why, soon.

  ucba <- data.frame(UCBAdmissions)
  > head(ucba)
       Admit Gender Dept Freq
  1 Admitted   Male    A  512
  2 Rejected   Male    A  313
  3 Admitted Female    A   89
  4 Rejected Female    A   19
  5 Admitted   Male    B  353
  6 Rejected   Male    B  207

Now, what we want is a count of the frequencies of number of students in each of the following four categories:

Accepted female
Rejected female
Accepted male
Rejected male

Do you remember the frequency...

Data Analysis with R

Data Analysis with R

Overview of this book

Related Content you might be interested in

Current Title:

Data Analysis with R

Relationships between two categorical variables