Book Image

Mastering Scientific Computing with R

Book Image

Mastering Scientific Computing with R

Overview of this book

Table of Contents (17 chapters)
Mastering Scientific Computing with R
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Cleaning datasets in R


The first step in any data analysis is preparing the data for the analysis. The rest of this chapter will mostly deal with this topic, but here we will review some basic considerations and R techniques. The most important part of any data analysis is to know the dataset and to have some idea of how each of the variables in the dataset was created.

For a basic overview, we will use the pumpkin dataset, which is short and artificial. Have a look at all of the following data in it:

pumpkins <- read.csv('messy_pumpkins.txt', stringsAsFactors = FALSE)
> pumpkins
      weight      location
1        2.3        europe
2      2.4kg       Europee
3     3.1 kg           USA
4 2700 grams United States
5         24          U.S.

Tip

When loading data frames, R's default behavior is to treat strings as categorical factors rather than as literal strings. This is usually the desired behavior of a dataset with consistently denoted factors but a problem if the same factors have been...