Mastering Scientific Computing with R

The first step in any data analysis is preparing the data for the analysis. The rest of this chapter will mostly deal with this topic, but here we will review some basic considerations and R techniques. The most important part of any data analysis is to know the dataset and to have some idea of how each of the variables in the dataset was created.

For a basic overview, we will use the pumpkin dataset, which is short and artificial. Have a look at all of the following data in it:

pumpkins <- read.csv('messy_pumpkins.txt', stringsAsFactors = FALSE)
> pumpkins
      weight      location
1        2.3        europe
2      2.4kg       Europee
3     3.1 kg           USA
4 2700 grams United States
5         24          U.S.

Tip

When loading data frames, R's default behavior is to treat strings as categorical factors rather than as literal strings. This is usually the desired behavior of a dataset with consistently denoted factors but a problem if the same factors have been...

Mastering Scientific Computing with R

Mastering Scientific Computing with R

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Scientific Computing with R

Cleaning datasets in R

Tip