A factor is another important data type in R, especially when we deal with categorical variables. In an R vector, there is no limit on the number of distinct elements but, in factor variables, it takes only a limited number of distinct elements. This type of variable is usually referred to as a categorical variable during data analysis and statistical modeling. In statistical modeling, the behavior of a numeric variable and categorical variable is different, so it is important to store the data correctly to ensure valid statistical analysis.
In R, a factor variable stores distinct numeric values internally and uses another character set to display the contents of that variable. In other software, such as Stata, internal numeric values are known as values, and the character set is known as value labels. Previously, we saw that the mode of a factor variable is numeric; this is due to the internal values of the factor variable.
A factor variable can be created using the factor
command; the only required input is a vector of values, which will be returned as a vector of factor values. The input can be numeric or character, but the levels of factor will always be a character. The following example shows how to create factor variables:
#creating factor variable with only one argument with factor() > factor1 <- factor(c(1,2,3,4,5,6,7,8,9)) > factor1 [1] 1 2 3 4 5 6 7 8 9 Levels: 1 2 3 4 5 6 7 8 9 > levels(factor1) [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" > labels(factor) [1] "1" > labels(factor1) [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" #creating factor with user given levels to display > factor2 <- factor(c(1,2,3,4,5,6,7,8,9),labels=letters[1:9]) > factor2 [1] a b c d e f g h i Levels: a b c d e f g h i > levels(factor2) [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" > labels(factor2) [1] "1" "2" "3" "4" "5" "6" "7" "8" "9"
In a factor variable, the values themselves are stored as numeric vectors, whereas the labels store only unique characters, and a label stores only once for each unique character. Factors can be ordered if the ordered=T
command is specified; otherwise, they inherit the order of the levels specified.
A factor could be numeric with numeric levels, but direct mathematical operations are not possible with this numeric factor. Special care should be taken if we want to use mathematical operations.
The following example shows a numeric factor and its mathematical operation:
# creating numeric factor and trying to find out mean > num.factor <- factor(c(5,7,9,5,6,7,3,5,3,9,7)) > num.factor [1] 5 7 9 5 6 7 3 5 3 9 7 Levels: 3 5 6 7 9 > mean(num.factor) [1] NA Warning message: In mean.default(num.factor) : argument is not numeric or logical: returning NA
From the preceding example, we see that we can create a numeric factor, but the mathematical operation is not possible. When we tried to perform a mathematical operation, it returned a warning message and produced the result NA
. To perform any mathematical operation, we need to convert the factor to its numeric counterpart. One can assume that we can easily convert the factor to numeric using the as.numeric()
function but, if we use the as.numeric()
function, it will only convert the internal values of the factors, not the desired values.
So, the conversion must be done with levels of that factor variable; optionally, we can first convert the factor into a character using as.character()
and then use as.numeric()
.
The following example describes this scenario:
> num.factor <- factor(c(5,7,9,5,6,7,3,5,3,9,7)) > num.factor [1] 5 7 9 5 6 7 3 5 3 9 7 Levels: 3 5 6 7 9 #as.numeric() function only returns internal values of the factor > as.numeric(num.factor) [1] 2 4 5 2 3 4 1 2 1 5 4 # now see the levels of the factor > levels(num.factor) [1] "3" "5" "6" "7" "9" > as.character(num.factor) [1] "5" "7" "9" "5" "6" "7" "3" "5" "3" "9" "7" # now to convert the "num.factor" to numeric there are two method # method-1: > mean(as.numeric(as.character(num.factor))) [1] 6 # method-2: > mean(as.numeric(levels(num.factor)[num.factor])) [1] 6
A data frame is a rectangular arrangement of rows and columns of vectors and/or factors, such as a spreadsheet in MS Excel. The columns represent variables in the data, and the rows represent observations or records. In other software, such as a database package, each column represents a field, and each row represents a record. Dealing with data does not mean dealing with only one vector or factor variable; it is rather a collection of variables. Each column represents only one type of data: numeric, character, or logical. Each row represents case information across all columns. One important thing to remember about R data frames is that all vectors should be of the same length. In an R data frame, we can store different types of variables, such as numeric, logical, factor, and character. To create a data frame, we can use the data.frame()
command.
The following example shows us how to create a data frame using different vectors and factors:
#creating vector of different variables and then creating data frame > var1 <- c(101,102,103,104,105) > var2 <- c(25,22,29,34,33) > var3 <- c("Non-Diabetic", "Diabetic", "Non-Diabetic", "Non-Diabetic", "Diabetic") > var4 <- factor(c("male","male","female","female","male")) # now we will create data frame using two numeric vectors one # character vector and one factor > diab.dat <- data.frame(var1,var2,var3,var4) > diab.dat var1 var2 var3 var4 1 101 25 Non-Diabetic male 2 102 22 Diabetic male 3 103 29 Non-Diabetic female 4 104 34 Non-Diabetic female 5 105 33 Diabetic male
Now, if we look at the class of individual columns of the newly created data frame, we will see that the first two columns' classes are numeric, and the last two columns' classes are factor, though, initially, the class of var3
was character. One thing is obvious here—when we create data frames and any one of the column's classes is character, it automatically gets converted to factor, which is a default R operation. However, there is one argument, stringsAsFactors=FALSE
, that allows us to prevent the automatic conversion of character to factor during data frame creation.
In the following example, we will see this:
#class of each column before creating data frame > class(var1) [1] "numeric" > class(var2) [1] "numeric" > class(var3) [1] "character" > class(var4) [1] "factor" # class of each column after creating data frame > class(diab.dat$var1) [1] "numeric" > class(diab.dat$var2) [1] "numeric" > class(diab.dat$var3) [1] "factor" > class(diab.dat$var4) [1] "factor" # now create the data frame specifying as.is=TRUE > diab.dat.2 <- data.frame(var1,var2,var3,var4,stringsAsFactors=FALSE) > diab.dat.2 var1 var2 var3 var4 1 101 25 Non-Diabetic male 2 102 22 Diabetic male 3 103 29 Non-Diabetic female 4 104 34 Non-Diabetic female 5 105 33 Diabetic male > class(diab.dat.2$var3) [1] "character"
To access individual columns (variables) from a data frame, we can use a dollar ($) sign, along with the data frame name–for example, diab.dat$var1
.
There are some other ways to access variables from a data frame, such as the following:
The data frame name followed by double square brackets with variable names within quotation marks–for example,
diab.dat[["var1"]]
The data frame name followed by single square brackets with the column index–for example,
diab.dat[,1]
Besides these, there is one other way that allows us to access each of the individual variables as separate objects. The R attach()
function allows us to access individual variables as separate R objects. When we use the attach()
command, we need to use detach()
to remove individual variables from the working environment.
Let's have a look at the following code:
# To run the folloing code snipped, # the code block 16 need to run. # Especially var1 var2 var3 and var4. # After that, from code block 17 "diab.dat.2" object should run # The following line will remove var1 to var4 # object from the workspace > rm(var1);rm(var2);rm(var3);rm(var4) # The following command will allow # us to access individual variables > attach(diab.dat.2) # Printing valuse of var1 > var1 # checking calss of var3 > class(var3) # Now to detach the data frame from the workspace > detach(diab.dat.2) # Now if we try to print individual varaiable it will give error > var1
A matrix is also a two-dimensional arrangement of data, but it can take only one class. To perform any mathematical operations, all columns of a matrix should be numeric. However, in data frames, we can store numeric, character, or factor columns. To perform any mathematical operation, especially a matrix operation, we can use matrix objects. However, in data frames, we are unable to perform certain types of mathematical operation, such as matrix multiplication. To create a matrix, we can use the matrix()
command or convert a numeric data frame to a matrix using as.matrix()
.
We can convert the data frame that we created earlier as diab.dat
to a matrix using as.matrix()
. However, this is not suitable for performing mathematical operations, as shown in the following example:
# data frame to matrix conversion > mat.diab <- as.matrix(diab.dat) > mat.diab var1 var2 var3 var4 [1,] "101" "25" "Non-Diabetic" "male" [2,] "102" "22" "Diabetic" "male" [3,] "103" "29" "Non-Diabetic" "female" [4,] "104" "34" "Non-Diabetic" "female" [5,] "105" "33" "Diabetic" "male" > class(mat.diab) [1] "matrix" > mode(mat.diab) [1] "character" # matrix multiplication is not possible with this newly created matrix > t(mat.diab) %*% mat.diab Error in t(mat.diab) %*% mat.diab : requires numeric/complex matrix/vector arguments # creating a matrix with numeric elements only # To produce the same matrix over time we set a seed value > set.seed(12345) > num.mat <- matrix(rnorm(9),nrow=3,ncol=3) > num.mat [,1] [,2] [,3] [1,] 0.5855288 -0.4534972 0.6300986 [2,] 0.7094660 0.6058875 -0.2761841 [3,] -0.1093033 -1.8179560 -0.2841597 > class(num.mat) [1] "matrix" > mode(num.mat) [1] "numeric" # matrix multiplication > t(num.mat) %*% num.mat [,1] [,2] [,3] [1,] 0.8581332 0.36302951 0.20405722 [2,] 0.3630295 3.87772320 0.06350551 [3,] 0.2040572 0.06350551 0.55404860
An array is a multiply subscripted data entry that allows the storing of data frames, matrices, or vectors of different types. Data frames and matrices are of two dimensions only, but an array can be of any number of dimensions. Sometimes, we need to store multiple matrices or data frames into a single object; in this case, we can use arrays to store this data.
Here is a simple example to store three matrices of order 2 x 2 in a single array object:
> mat.array=array(dim=c(2,2,3)) # To produce the same results over time we set a seed value > set.seed(12345) > mat.array[,,1]<-rnorm(4) > mat.array[,,2]<-rnorm(4) > mat.array[,,3]<-rnorm(4) > mat.array , , 1 [,1] [,2] [1,] 0.5855288 -0.1093033 [2,] 0.7094660 -0.4534972 , , 2 [,1] [,2] [1,] 0.6058875 0.6300986 [2,] -1.8179560 -0.2761841 , , 3 [,1] [,2] [1,] -0.2841597 -0.1162478 [2,] -0.9193220 1.8173120
A list object is a generic R object that can store other objects of any type. In a list object, we can store single constants, vectors of numeric values, factors, data frames, matrices, and even arrays.
Recalling the var1
, var2
, var3
, and var4
vectors, the data frame created using these vectors, and also recalling the array created in the Arrays section, we will create a list object in the following example:
> var1 <- c(101,102,103,104,105) > var2 <- c(25,22,29,34,33) > var3 <- c("Non-Diabetic", "Diabetic", "Non-Diabetic", "Non-Diabetic", "Diabetic") > var4 <- factor(c("male","male","female","female","male")) > diab.dat <- data.frame(var1,var2,var3,var4) > mat.array<-array(dim=c(2,2,3)) > set.seed(12345) > mat.array[,,1]<-rnorm(4) > mat.array[,,2]<-rnorm(4) > mat.array[,,3]<-rnorm(4) # creating list > obj.list <- list(elem1=var1,elem2=var2,elem3=var3,elem4=var4,elem5=diab.dat,elem6=mat.array) > obj.list $elem1 [1] 101 102 103 104 105 $elem2 [1] 25 22 29 34 33 $elem3 [1] "Non-Diabetic" "Diabetic" "Non-Diabetic" "Non-Diabetic" "Diabetic" $elem4 [1] male male female female male Levels: female male $elem5 var1 var2 var3 var4 1 101 25 Non-Diabetic male 2 102 22 Diabetic male 3 103 29 Non-Diabetic female 4 104 34 Non-Diabetic female 5 105 33 Diabetic male $elem6 , , 1 [,1] [,2] [1,] 0.5855288 -0.1093033 [2,] 0.7094660 -0.4534972 , , 2 [,1] [,2] [1,] 0.6058875 0.6300986 [2,] -1.8179560 -0.2761841 , , 3 [,1] [,2] [1,] -0.2841597 -0.1162478 [2,] -0.9193220 1.8173120
To access individual elements from a list
object, we can use the name of that element or use double square brackets with the index of those elements. For example, obj.list[[1]]
will give the first element of the newly created list object.