This recipe will discuss how to use the built-in R functions to manipulate data. As data manipulation is the most time-consuming part of most analysis procedures, you should gain knowledge of how to apply these functions on data.
Ensure you have completed the previous recipes by installing R on your operating system.
Perform the following steps to manipulate the data with R.
Subset the data using the bracelet notation:
- Load the dataset
iris
into the R session:
> data(iris)
- To select values, you may use a bracket notation that designates the indices of the dataset. The first index is for the rows and the second for the columns:
> iris[1,"Sepal.Length"] Output: [1] 5.1
- You can also select multiple columns using
c()
:
> Sepal.iris = iris[, c("Sepal.Length", "Sepal.Width")]
- You can then use
str()
to summarize and display the internal structure ofSepal.iris
:
> str(Sepal.iris) Output: 'data.frame': 150 obs. of 2 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ..
- To subset data with the rows of given indices, you can specify the indices at the first index with the bracket notation. In this example, we show you how to subset data with the top five records with the
Sepal.Length
column and theSepal.Width
selected:
> Five.Sepal.iris = iris[1:5, c("Sepal.Length", "Sepal.Width")] > str(Five.Sepal.iris) Output: 'data.frame': 5 obs. of 2 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 $ Sepal.Width : num 3.5 3 3.2 3.1 3.6
- It is also possible to set conditions to filter the data. For example, to filter returned records containing the
setosa
data with all five variables. In the following example, the first index specifies the returning criteria, and the second index specifies the range of indices of the variable returned:
> setosa.data = iris[iris$Species=="setosa",1:5] > str(setosa.data) Output: 'data.frame': 50 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
- Alternatively, the
which
function returns the indexes of satisfied data. The following example returns the indices of theiris
data containing species equal tosetosa
:
> which(iris$Species=="setosa") Output: [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50
- The indices returned by the operation can then be applied as the index to select the
iris
containing thesetosa
species. The following example returns thesetosa
with all five variables:
> setosa.data = iris[which(iris$Species=="setosa"),1:5] > str(setosa.data) Output: 'data.frame': 50 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Subset data using the subset
function:
- Besides using the bracket notation, R provides a
subset
function that enables users to subset the DataFrame by observations with a logical statement. - First, subset species, sepal length, and sepal width out of the
iris
data. To select the sepal length and width out of theiris
data, one should specify the column to be subset in theselect
argument:
> Sepal.data = subset(iris, select=c("Sepal.Length", "Se- pal.Width")) > str(Sepal.data) Output: 'data.frame': 150 obs. of 2 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
This reveals that Sepal.data
contains 150 objects with the Sepal.Length
variable and Sepal.Width
.
- On the other hand, you can use a subset argument to get subset data containing
setosa
only. In the second argument of the subset function, you can specify the subset criteria:
> setosa.data = subset(iris, Species =="setosa") > str(setosa.data) Output: 'data.frame': 50 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
- Most of the time, you may want to apply a union or intersect a condition while subsetting data. The
OR
andAND
operations can be further employed for this purpose. For example, if you would like to retrieve data withPetal.Width >=0.2
andPetal.Length < = 1.4
:
> example.data= subset(iris, Petal.Length <=1.4 & Petal.Width >= 0.2, select=Species ) > str(example.data) Output: 'data.frame': 21 obs. of 1 variable: $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
- Merging data: Merging data involves joining two DataFrames into a merged DataFrame by a common column or row name. The following example shows how to merge the
flower.type
DataFrame and the first three rows of the iris with a common row name within theSpecies
column:
> flower.type = data.frame(Species = "setosa", Flower = "iris") > merge(flower.type, iris[1:3,], by ="Species") Output: Species Flower Sepal.Length Sepal.Width Petal.Length Petal.Width 1 setosa iris 5.1 3.5 1.4 0.2 2 setosa iris 4.9 3.0 1.4 0.2 3 setosa iris 4.7 3.2 1.3 0.2
- Ordering data: The
order
function will return the index of a sorted DataFrame with a specified column. The following example shows the results from the first six records with the sepal length ordered (from big to small) iris data:
> head(iris[order(iris$Sepal.Length, decreasing = TRUE),]) Output: Sepal.Length Sepal.Width Petal.Length Petal.Width Species 132 7.9 3.8 6.4 2.0 virginica 118 7.7 3.8 6.7 2.2 virginica 119 7.7 2.6 6.9 2.3 virginica 123 7.7 2.8 6.7 2.0 virginica 136 7.7 3.0 6.1 2.3 virginica 106 7.6 3.0 6.6 2.1 virginica
Before conducting data analysis, it is important to organize collected data into a structured format. Therefore, we can simply use the R DataFrame to subset, merge, and order a dataset. This recipe first introduces two methods to subset data: one uses the bracket notation, while the other uses the subset
function. You can use both methods to generate the subset data by selecting columns and filtering data with the given criteria. The recipe then introduces the merge
function to merge DataFrames. Last, the recipe introduces how to use order
to sort the data.
The sub
and gsub
functions allow using regular expression to substitute a string. The sub
and gsub
functions perform the replacement of the first and all the other matches, respectively:
> sub("e", "q", names(iris)) Output: [1] "Sqpal.Length" "Sqpal.Width" "Pqtal.Length" "Pqtal.Width" "Spqcies" > gsub("e", "q", names(iris)) Output: [1] "Sqpal.Lqngth" "Sqpal.Width" "Pqtal.Lqngth" "Pqtal.Width" "Spqciqs"