Book Image

Data Manipulation with R - Second Edition

By : Jaynal Abedin, Kishor Kumar Das
Book Image

Data Manipulation with R - Second Edition

By: Jaynal Abedin, Kishor Kumar Das

Overview of this book

<p>This book starts with the installation of R and how to go about using R and its libraries. We then discuss the mode of R objects and its classes and then highlight different R data types with their basic operations.</p> <p>The primary focus on group-wise data manipulation with the split-apply-combine strategy has been explained with specific examples. The book also contains coverage of some specific libraries such as lubridate, reshape2, plyr, dplyr, stringr, and sqldf. You will not only learn about group-wise data manipulation, but also learn how to efficiently handle date, string, and factor variables along with different layouts of datasets using the reshape2 package.</p> <p>By the end of this book, you will have learned about text manipulation using stringr, how to extract data from twitter using twitteR library, how to clean raw data, and how to structure your raw data for data mining.</p>
Table of Contents (13 chapters)
Data Manipulation with R Second Edition
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Factor and its types


A factor is another important data type in R, especially when we deal with categorical variables. In an R vector, there is no limit on the number of distinct elements but, in factor variables, it takes only a limited number of distinct elements. This type of variable is usually referred to as a categorical variable during data analysis and statistical modeling. In statistical modeling, the behavior of a numeric variable and categorical variable is different, so it is important to store the data correctly to ensure valid statistical analysis.

In R, a factor variable stores distinct numeric values internally and uses another character set to display the contents of that variable. In other software, such as Stata, internal numeric values are known as values, and the character set is known as value labels. Previously, we saw that the mode of a factor variable is numeric; this is due to the internal values of the factor variable.

A factor variable can be created using the factor command; the only required input is a vector of values, which will be returned as a vector of factor values. The input can be numeric or character, but the levels of factor will always be a character. The following example shows how to create factor variables:

#creating factor variable with only one argument with factor() 
> factor1 <- factor(c(1,2,3,4,5,6,7,8,9))
> factor1
[1] 1 2 3 4 5 6 7 8 9
Levels: 1 2 3 4 5 6 7 8 9
> levels(factor1)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9"
> labels(factor)
[1] "1"
> labels(factor1)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9"

#creating factor with user given levels to display
> factor2 <- factor(c(1,2,3,4,5,6,7,8,9),labels=letters[1:9])
> factor2
[1] a b c d e f g h i
Levels: a b c d e f g h i
> levels(factor2)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i"
> labels(factor2)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9"

In a factor variable, the values themselves are stored as numeric vectors, whereas the labels store only unique characters, and a label stores only once for each unique character. Factors can be ordered if the ordered=T command is specified; otherwise, they inherit the order of the levels specified.

A factor could be numeric with numeric levels, but direct mathematical operations are not possible with this numeric factor. Special care should be taken if we want to use mathematical operations.

The following example shows a numeric factor and its mathematical operation:

# creating numeric factor and trying to find out mean
> num.factor <- factor(c(5,7,9,5,6,7,3,5,3,9,7))
> num.factor
[1] 5 7 9 5 6 7 3 5 3 9 7
Levels: 3 5 6 7 9
> mean(num.factor)
[1] NA
Warning message:
In mean.default(num.factor) :
argument is not numeric or logical: returning NA

From the preceding example, we see that we can create a numeric factor, but the mathematical operation is not possible. When we tried to perform a mathematical operation, it returned a warning message and produced the result NA. To perform any mathematical operation, we need to convert the factor to its numeric counterpart. One can assume that we can easily convert the factor to numeric using the as.numeric() function but, if we use the as.numeric() function, it will only convert the internal values of the factors, not the desired values.

So, the conversion must be done with levels of that factor variable; optionally, we can first convert the factor into a character using as.character() and then use as.numeric().

The following example describes this scenario:

> num.factor <- factor(c(5,7,9,5,6,7,3,5,3,9,7))
> num.factor
[1] 5 7 9 5 6 7 3 5 3 9 7
Levels: 3 5 6 7 9
#as.numeric() function only returns internal values of the factor
> as.numeric(num.factor)
[1] 2 4 5 2 3 4 1 2 1 5 4
# now see the levels of the factor
> levels(num.factor)
[1] "3" "5" "6" "7" "9"
> as.character(num.factor)
[1] "5" "7" "9" "5" "6" "7" "3" "5" "3" "9" "7"

# now to convert the "num.factor" to numeric there are two method
# method-1: 
> mean(as.numeric(as.character(num.factor)))
[1] 6

# method-2:
> mean(as.numeric(levels(num.factor)[num.factor]))
[1] 6

Data frame

A data frame is a rectangular arrangement of rows and columns of vectors and/or factors, such as a spreadsheet in MS Excel. The columns represent variables in the data, and the rows represent observations or records. In other software, such as a database package, each column represents a field, and each row represents a record. Dealing with data does not mean dealing with only one vector or factor variable; it is rather a collection of variables. Each column represents only one type of data: numeric, character, or logical. Each row represents case information across all columns. One important thing to remember about R data frames is that all vectors should be of the same length. In an R data frame, we can store different types of variables, such as numeric, logical, factor, and character. To create a data frame, we can use the data.frame() command.

The following example shows us how to create a data frame using different vectors and factors:

#creating vector of different variables and then creating data frame
> var1 <- c(101,102,103,104,105)
> var2 <- c(25,22,29,34,33)
> var3 <- c("Non-Diabetic", "Diabetic", "Non-Diabetic", "Non-Diabetic", "Diabetic")
> var4 <- factor(c("male","male","female","female","male"))
# now we will create data frame using two numeric vectors one 
# character vector and one factor
> diab.dat <- data.frame(var1,var2,var3,var4)
> diab.dat
   var1 var2         var3   var4
1  101   25 Non-Diabetic   male
2  102   22     Diabetic   male
3  103   29 Non-Diabetic female
4  104   34 Non-Diabetic female
5  105   33     Diabetic   male

Now, if we look at the class of individual columns of the newly created data frame, we will see that the first two columns' classes are numeric, and the last two columns' classes are factor, though, initially, the class of var3 was character. One thing is obvious here—when we create data frames and any one of the column's classes is character, it automatically gets converted to factor, which is a default R operation. However, there is one argument, stringsAsFactors=FALSE, that allows us to prevent the automatic conversion of character to factor during data frame creation.

In the following example, we will see this:

#class of each column before creating data frame 
> class(var1)
[1] "numeric"
> class(var2)
[1] "numeric"
> class(var3)
[1] "character"
> class(var4)
[1] "factor"
# class of each column after creating data frame
> class(diab.dat$var1)
[1] "numeric"
> class(diab.dat$var2)
[1] "numeric"
> class(diab.dat$var3)
[1] "factor"
> class(diab.dat$var4)
[1] "factor"
# now create the data frame specifying as.is=TRUE
> diab.dat.2 <- data.frame(var1,var2,var3,var4,stringsAsFactors=FALSE)
> diab.dat.2
var1 var2         var3   var4
1  101   25 Non-Diabetic   male
2  102   22     Diabetic   male
3  103   29 Non-Diabetic female
4  104   34 Non-Diabetic female
5  105   33     Diabetic   male

> class(diab.dat.2$var3)
[1] "character"

To access individual columns (variables) from a data frame, we can use a dollar ($) sign, along with the data frame name–for example, diab.dat$var1.

There are some other ways to access variables from a data frame, such as the following:

  • The data frame name followed by double square brackets with variable names within quotation marks–for example, diab.dat[["var1"]]

  • The data frame name followed by single square brackets with the column index–for example, diab.dat[,1]

Besides these, there is one other way that allows us to access each of the individual variables as separate objects. The R attach() function allows us to access individual variables as separate R objects. When we use the attach() command, we need to use detach() to remove individual variables from the working environment.

Let's have a look at the following code:

# To run the folloing code snipped, 
# the code block 16 need to run.
# Especially var1 var2 var3 and var4. 
# After that, from code block 17 "diab.dat.2" object should run
 
# The following line will remove var1 to var4 
# object from the workspace
> rm(var1);rm(var2);rm(var3);rm(var4)
# The following command will allow 
# us to access individual variables 
> attach(diab.dat.2)
# Printing valuse of var1
> var1
# checking calss of var3
> class(var3)
# Now to detach the data frame from the workspace
> detach(diab.dat.2)
# Now if we try to print individual varaiable it will give error
> var1

Matrices

A matrix is also a two-dimensional arrangement of data, but it can take only one class. To perform any mathematical operations, all columns of a matrix should be numeric. However, in data frames, we can store numeric, character, or factor columns. To perform any mathematical operation, especially a matrix operation, we can use matrix objects. However, in data frames, we are unable to perform certain types of mathematical operation, such as matrix multiplication. To create a matrix, we can use the matrix() command or convert a numeric data frame to a matrix using as.matrix().

We can convert the data frame that we created earlier as diab.dat to a matrix using as.matrix(). However, this is not suitable for performing mathematical operations, as shown in the following example:

# data frame to matrix conversion
> mat.diab <- as.matrix(diab.dat)
> mat.diab
     var1  var2 var3           var4    
[1,] "101" "25" "Non-Diabetic" "male"  
[2,] "102" "22" "Diabetic"     "male"  
[3,] "103" "29" "Non-Diabetic" "female"
[4,] "104" "34" "Non-Diabetic" "female"
[5,] "105" "33" "Diabetic"     "male"

> class(mat.diab)
[1] "matrix"
> mode(mat.diab)
[1] "character"

# matrix multiplication is not possible with this newly created matrix

> t(mat.diab) %*% mat.diab
Error in t(mat.diab) %*% mat.diab : 
requires numeric/complex matrix/vector arguments

# creating a matrix with numeric elements only
# To produce the same matrix over time we set a seed value
> set.seed(12345) 
> num.mat <- matrix(rnorm(9),nrow=3,ncol=3)
> num.mat
           [,1]       [,2]       [,3]
[1,]  0.5855288 -0.4534972  0.6300986
[2,]  0.7094660  0.6058875 -0.2761841
[3,] -0.1093033 -1.8179560 -0.2841597

> class(num.mat)
[1] "matrix"
> mode(num.mat)
[1] "numeric"

# matrix multiplication
> t(num.mat) %*% num.mat
          [,1]       [,2]       [,3]
[1,] 0.8581332 0.36302951 0.20405722
[2,] 0.3630295 3.87772320 0.06350551
[3,] 0.2040572 0.06350551 0.55404860

Arrays

An array is a multiply subscripted data entry that allows the storing of data frames, matrices, or vectors of different types. Data frames and matrices are of two dimensions only, but an array can be of any number of dimensions. Sometimes, we need to store multiple matrices or data frames into a single object; in this case, we can use arrays to store this data.

Here is a simple example to store three matrices of order 2 x 2 in a single array object:

> mat.array=array(dim=c(2,2,3))

# To produce the same results over time we set a seed value
> set.seed(12345)

> mat.array[,,1]<-rnorm(4)
> mat.array[,,2]<-rnorm(4)
> mat.array[,,3]<-rnorm(4)

> mat.array
, , 1

          [,1]       [,2]
[1,] 0.5855288 -0.1093033
[2,] 0.7094660 -0.4534972

, , 2

           [,1]       [,2]
[1,]  0.6058875  0.6300986
[2,] -1.8179560 -0.2761841

, , 3

           [,1]       [,2]
[1,] -0.2841597 -0.1162478
[2,] -0.9193220  1.8173120

List

A list object is a generic R object that can store other objects of any type. In a list object, we can store single constants, vectors of numeric values, factors, data frames, matrices, and even arrays.

Recalling the var1, var2, var3, and var4 vectors, the data frame created using these vectors, and also recalling the array created in the Arrays section, we will create a list object in the following example:

> var1 <- c(101,102,103,104,105)
> var2 <- c(25,22,29,34,33)
> var3 <- c("Non-Diabetic", "Diabetic", "Non-Diabetic", "Non-Diabetic", "Diabetic")
> var4 <- factor(c("male","male","female","female","male"))
> diab.dat <- data.frame(var1,var2,var3,var4)

> mat.array<-array(dim=c(2,2,3))

> set.seed(12345)

> mat.array[,,1]<-rnorm(4)
> mat.array[,,2]<-rnorm(4)
> mat.array[,,3]<-rnorm(4)

# creating list
> obj.list <- list(elem1=var1,elem2=var2,elem3=var3,elem4=var4,elem5=diab.dat,elem6=mat.array) 


> obj.list
$elem1
[1] 101 102 103 104 105

$elem2
[1] 25 22 29 34 33

$elem3
[1] "Non-Diabetic" "Diabetic"     "Non-Diabetic" "Non-Diabetic" "Diabetic"    

$elem4
[1] male   male   female female male  
Levels: female male

$elem5
  var1 var2         var3   var4
1  101   25 Non-Diabetic   male
2  102   22     Diabetic   male
3  103   29 Non-Diabetic female
4  104   34 Non-Diabetic female
5  105   33     Diabetic   male

$elem6
, , 1

          [,1]       [,2]
[1,] 0.5855288 -0.1093033
[2,] 0.7094660 -0.4534972

, , 2

           [,1]       [,2]
[1,]  0.6058875  0.6300986
[2,] -1.8179560 -0.2761841

, , 3

           [,1]       [,2]
[1,] -0.2841597 -0.1162478
[2,] -0.9193220  1.8173120

To access individual elements from a list object, we can use the name of that element or use double square brackets with the index of those elements. For example, obj.list[[1]] will give the first element of the newly created list object.