Book Image

R Programming By Example

By : Omar Trejo Navarro
Book Image

R Programming By Example

By: Omar Trejo Navarro

Overview of this book

R is a high-level statistical language and is widely used among statisticians and data miners to develop analytical applications. Often, data analysis people with great analytical skills lack solid programming knowledge and are unfamiliar with the correct ways to use R. Based on the version 3.4, this book will help you develop strong fundamentals when working with R by taking you through a series of full representative examples, giving you a holistic view of R. We begin with the basic installation and configuration of the R environment. As you progress through the exercises, you'll become thoroughly acquainted with R's features and its packages. With this book, you will learn about the basic concepts of R programming, work efficiently with graphs, create publication-ready and interactive 3D graphs, and gain a better understanding of the data at hand. The detailed step-by-step instructions will enable you to get a clean set of data, produce good visualizations, and create reports for the results. It also teaches you various methods to perform code profiling and performance enhancement with good programming practices, delegation, and parallelization. By the end of this book, you will know how to efficiently work with data, create quality visualizations and reports, and develop code that is modular, expressive, and maintainable.
Table of Contents (12 chapters)

Working with data types and data structures

This section summarizes the most important data types and data structures in R. In this brief overview, we won't discuss them in depth. We will only show a couple of examples that will allow you to understand the code shown throughout this book. If you want to dig deeper into them, you may look into their documentation or some of the references pointed out in this chapter's introduction.

The basic data types in R are numbers, text, and Boolean values (TRUE or FALSE), which R calls numerics, characters, and logicals, respectively. Strictly speaking, there are also types for integers, complex numbers, and raw data (bytes), but we won't use them explicitly in this book. The six basic data structures in R are vectors, factors, matrices, data frames, and lists, which we will summarize in the following sections.

Numerics

Numbers in R behave pretty much as you would mathematically expect them to. For example, the operation 2 / 3 performs real division, which results in 0.6666667 in R. This natural numeric behavior is very convenient for data analysis, as you don't need to pay too much attention when using numbers of different types, which may require special handling in other languages. Also the mathematical priorities for operators applies, as well the use of parenthesis.

The following example shows how variables can be used within operations, and how operator priorities are handled. As you can see, you may mix the use of variables with values when performing operations:

x <- 2
y <- 3
z <- 4
(x * y + z) / 5
#> [1] 2

The modulo operation can be performed with the %% symbol, while integer division can be performed with the %/% symbol:

7 %% 3
#> [1] 1 7 %/% 3
#> [1] 2

Special values

There are a few special values in R. The NA values are used to represent missing values, which stands for not available. If a computation results in a number that is too big, R will return Inf for a positive number and -Inf for a negative number, meaning positive and negative infinity, respectively. These are also returned when a number is divided by 0. Sometimes a computation will produce a result that makes little sense. In these cases, we will get a NaN, which stands for not a number. And, finally, there is a null object, represented by NULL. The symbol NULL always points to the same object (which is a data type on its own) and is often used as a default argument in functions to mean that no value was passed through. You should know that NA, Inf, -Inf, NaN, and NULL are not substitutes for each other.

There are specific NA values for numerics, characters, and logicals, but we will stick to the simple NA, which is internally treated as a logical.

In the following example, you can see how these special values behave when used among themselves in R. Note that 1 / 0 results in Inf, 0 / 0, Inf - Inf, and Inf / Inf results in undefined represented by NaN, but Inf + Inf, 0 / Inf, and Inf / 0, result in Inf, 0, and Inf, respectively. It's no coincidence that these results resemble mathematical definitions. Also note that any operation including NaN or NA will also result in NaN and NA, respectively:

1 / 0
#> [1] Inf -1 / 0
#> [1] -Inf 0 / 0
#> [1] NaN Inf + Inf
#> [1] Inf Inf - Inf
#> [1] NaN Inf / Inf
#> [1] NaN Inf / 0
#> [1] Inf 0 / Inf
#> [1] 0 Inf / NaN
#> [1] NaN Inf + NA
#> [1] NA

Characters

Text can be used just as easily, you just need to remember to use quotation marks (" ") around it. The following example shows how to save the text Hi, there! and "10" in two variables. Note that since "10.5" is surrounded by quotation marks, it is text and not a numeric value. To find what type of object you're actually dealing with you can use the class(), typeof(), and str() (short for structure) functions to get the metadata for the object in question.

In this case, since the y variable contains text, we can't multiply it by 2, as is seen in the error we get. Also, if you want to know the number of characters in a string, you can use the nchar() function, as follows:

x <- "Hi, there!"
y <- "10"
class(y)
#> [1] "character" typeof(y)
#> [1] "character" str(y)
#> chr "10" y * 2
#> Error in y * 2: non-numeric argument to binary operator nchar(x)
#> [1] 10 nchar(y)
#> [1] 2

Sometimes, you may have text information, as well as numeric information that you want to combine into a single string. In this case, you should use the paste() function. This function receives an arbitrary number of unnamed arguments, which is something we will define more precisely in a later section in this chapter. It then transforms each of these arguments into characters, and returns a single string with all of them combined. The following code shows such an example. Note how the numeric value of 10 in y was automatically transformed into a character type so that it could be pasted inside the rest of the string:

x <- "the x variable"
y <- 10
paste("The result for", x, "is", y)
#> [1] "The result for the x variable is 10"

Other times, you will want to replace some characters within a piece of text. In that case, you should use the gsub() function, which stands for global substitution. This function receives the string to be replaced as its first argument, the string to replace with as its second argument, and it will return the text in the third argument with the corresponding replacements:

x <- "The ball is blue"
gsub("blue", "red", x)
#> [1] "The ball is red"

Yet other times, you will want to know whether a string contains a substring, in which case you should use the gprel() function. The name for this function comes from terminal command known as grep, which is an acronym for global regular expression print (yes, you can also use regular expressions to look for matches). The l at the end of grepl() comes from the fact that the result is a logical:

x <- "The sky is blue"
grepl("blue", x)
#> [1] TRUE grepl("red", x)
#> [1] FALSE

Logicals

Logical vectors contain Boolean values, which can only be TRUE or FALSE. When you want to create logical variables with such values, you must avoid using quotation marks around them and remember that they are all capital letters, as shown here. When programming in R, logical values are commonly used to test a condition, which is in turn used to decide which branch from a complex program we should take. We will look at examples for this type of behavior in a later section in this chapter:

x <- TRUE

In R, you can easily convert values among different types with the as.*() functions, where * is used as a wildcard which can be replaced with character, numeric, or logical to convert among these types. The functions work by receiving an object of a different type from what the function name specifies and return the object parsed into the specified type if possible, or return an NA if it's not possible. The following example shows how to convert the TRUE string into a logical value, which in this case non-surprisingly turns out to be the logical TRUE:

as.logical("TRUE")
#> [1] TRUE

Converting from characters and numerics into logicals is one of those things that is not very intuitive in R. The following table shows some of this behavior. Note that even though the true string (all lowercase letters) is not a valid logical value when removing quotation marks, it is converted into a TRUE value when applying the as.logical() to it, for compatibility reasons. Also note that since T is a valid logical value, which is a shortcut for TRUE, it's corresponding text is also accepted as meaning such a value. The same logic applies to false and F. Any other string will return an NA value, meaning that the string could not be parsed as a logical value. Also note that 0 will be parsed as FALSE, but any other numeric value, including Inf, will be converted to a TRUE value. Finally, note that both NA and NaN will be parsed, returning NA in both cases.

The as.character() and as.numeric() functions have less counter-intuitive behavior, and I will leave you to explore them on your own. When you do, try to test as many edge cases as you can. Doing so will help you foresee possible issues as you develop your own programs.

Before we move on, you should know that these data structures can be organized by their dimensionality and whether they're homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types). Vectors, matrices, and arrays are homogeneous data structures, while lists and data frames are heterogeneous. Vectors and lists have a single dimension, matrices and data frames have two dimensions, and arrays can have as many dimensions as we want.

When it comes to dimensions, arrays in R are different from arrays in many other languages, where you would have to create an array of arrays to produce a two-dimensional structure, which is not necessary in R.

Vectors

The fundamental data type in R is the vector, which is an ordered collection of values. The first thing you should know is that unlike other languages, single values for numbers, strings, and logicals, are special cases of vectors (vectors of length one), which means that there's no concept of scalars in R. A vector is a one-dimensional data structure and all of its elements are of the same data type.

The simplest way to create a vector is with the c() function, which stands for combine, and coerces all of its arguments into a single type. The coercion will happen from simpler types into more complex types. That is, if we create a vector which contains logicals, numerics, and characters, as the following example shows, our resulting vector will only contain characters, which are the more complex of the three types. If we create a vector that contains logicals and numerics, our resulting vector will be numeric, again because it's the more complex type.

Vectors can be named or unnamed. Unnamed vector elements can only be accessed through positional references, while named vectors can be accessed through positional references as well as name references. In the example below, the y vector is a named vector, where each element is named with a letter from A to I. This means that in the case of x, we can only access elements using their position (the first position is considered as 1 instead of the 0 used in other languages), but in the case of y, we may also use the names we assigned.

Also note that the special values we mentioned before, that is NA, NULL, NaN, and Inf, will be coerced into characters if that's the more complex type, except NA, which stays the same. In case coercion is happening toward numerics, they all stay the same since they are valid numeric values. Finally, if we want to know the length of a vector, simply call the length() function upon it:

x <- c(TRUE, FALSE, -1, 0, 1, "A", "B", NA, NULL, NaN, Inf)
x
#> [1] "TRUE" "FALSE" "-1" "0" "1" "A" "B" NA
#> [9] "NaN" "Inf" x[1]
#> [1] "TRUE" x[5]
#> [1] "1" y <- c(A=TRUE, B=FALSE, C=-1, D=0, E=1, F=NA, G=NULL, H=NaN, I=Inf) y
#> A B C D E F H I
#> 1 0 -1 0 1 NA NaN Inf y[1]
#> A
#> 1
y["A"]
#> A
#> 1
y[5]
#> E
#> 1
y["E"]
#> E
#> 1
length(x)
#> [1] 10
length(y)
#> [1] 8

Furthermore, we can select sets or ranges of elements using vectors with index numbers for the values we want to retrieve. For example, using the selector c(1, 2) would retrieve the first two elements of the vector, while using the c(1, 3, 5) would return the first, third, and fifth elements. The : function (yes, it's a function even though we don't normally use the function-like syntax we have seen so far in other examples to call it), is often used as a shortcut to create range selectors. For example, the 1:5 syntax means that we want a vector with elements 1 through 5, which would be equivalent to explicitly using c(1, 2, 3, 4, 5). Furthermore, if we send a vector of logicals, which must have the same length as the vector we want to retrieve values from, each of the logical values will be associated to the corresponding position in the vector we want to retrieve from, and if the corresponding logical is TRUE, the value will be retrieved, but if it's FALSE, it won't be. All of these selection methods are shown in the following example:

x[c(1, 2, 3, 4, 5)]
#> [1] "TRUE" "FALSE" "-1" "0" "1"
x[1:5] #> [1] "TRUE" "FALSE" "-1" "0" "1"
x[c(1, 3, 5)]
#> [1] "TRUE" "-1" "1"
x[c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE,
FALSE, TRUE, FALSE, TRUE)]
#> [1] "TRUE" "-1" "1" "B" "NaN" NA

Next we will talk about operation among vectors. In the case of numeric vectors, we can apply operations element-to-element by simply using operators as we normally would. In this case, R will match the elements of the two vectors pairwise and return a vector. The following example shows how two vectors are added, subtracted, multiplied, and divided in an element-to-element way. Furthermore, since we are working with vectors of the same length, we may want to get their dot-product (if you don't know what a dot-product is, you may take a look at https://en.wikipedia.org/wiki/Dot_product), which we can do using the %*% operator, which performs matrix-like multiplications, in this case vector-to-vector:

x <- c(1, 2, 3, 4)
y <- c(5, 6, 7, 8)
x + y
#> [1] 6 8 10 12
x - y
#> [1] -4 -4 -4 -4
x * y
#> [1] 5 12 21 32
x / y
#> [1] 0.2000 0.3333 0.4286 0.5000
x %*% y
#> [,1]
#> [1,] 70

If you want to combine multiple vectors into a single one, you can simply use the c() recursively on them, and it will flatten them for you automatically. Let's say we want to combine the x and y into the z such that the y elements appear first. Furthermore, suppose that after we do we want to sort them, so we apply the sort() function on z:

z <- c(y, x)
z
#> [1] 5 6 7 8 1 2 3 4
sort(z)
#> [1] 1 2 3 4 5 6 7 8

A common source for confusion is how R deals with vectors of different lengths. If we apply an element-to-element operation, as the ones we covered earlier, but using vectors of different lengths, we may expect R to throw an error, as is the case in other languages. However, it does not. Instead, it repeats vector elements in order until they all have the same length. The following example shows three vectors, each of different lengths, and the result of adding them together.

The way R is configured by default, you will actually get a warning message to let you know that the vectors you operated on were not of the same length, but since R can be configured to avoid showing warnings, you should not rely on them:

c(1, 2) + c(3, 4, 5) + c(6, 7, 8, 9)
#> Warning in c(1, 2) + c(3, 4, 5):
longer object length is not a multiple of

#> shorter object length
#> Warning in c(1, 2) + c(3, 4, 5) + c(6, 7, 8, 9):
longer object length is

#> not a multiple of shorter object length
#> [1] 10 13 14 13

The first thing that may come to mind is that the first vector is expanded into c(1, 2, 1, 2), the second vector is expanded into c(3, 4, 5, 3), and the third one is kept as is, since it's the largest one. Then if we add these vectors together, the result would be c(10, 13, 14, 14). However, as you can see in the example, the result actually is c(10, 13, 14, 13). So, what are we missing? The source of confusion is that R does this step by step, meaning that it will first perform the addition c(1, 2) + c(3, 4, 5), which after being expanded is c(1, 2, 1) + c(3, 4, 5) and results in c(4, 6, 6), then given this result, the next step that R performs is c(4, 6, 6) + c(6, 7, 8, 9), which after being expanded is c(4, 6, 6, 4) + c(6, 7, 8, 9), and that's where the result we get comes from. It can be confusing at first, but just remember to imagine the operations step by step.

Finally, we will briefly mention a very powerful feature in R, known as vectorization. Vectorization means that you apply an operation to a vector at once, instead of independently doing so to each of its elements. This is a feature you should get to know quite well. Programming without it is considered to be bad R code, and not just for syntactic reasons, but because vectorized code takes advantage of many internal optimizations in R, which results in much faster code. We will show different ways of vectorizing code in Chapter 9, Implementing An Efficient Simple Moving Average, and in this chapter, we will see an example, followed by a couple more in following sections.

Even though the phrase vectorized code may seem scary or magical at first, in reality, R makes it quite simple to implement in some cases. For example, we can square each of the elements in the x vector by using the x symbol as if it were a single number. R is intelligent enough to understand that we want to apply the operation to each of the elements in the vector. Many functions in R can be applied using this technique:

x^2
#> [1] 1 4 9 16

We will see more examples that really showcase how vectorization can shine in the following section about functions, where we will see how to apply vectorized operations even when the operations depend on other parameters.

Factors

When analyzing data, it's quite common to encounter categorical values. R provides a good way to represent categorical values using factors, which are created using the factor() function and are integer vectors with associated labels for each integer. The different values that the factor can take are called levels. The levels() function shows all the levels from a factor, and the levels parameter of the factor() function can be used to explicitly define their order, which is alphabetical in case it's not explicitly defined.

Note that defining an explicit order can be important in linear modeling because the first level is used as the baseline level for functions like lm() (linear models), which we will use in Chapter 3, Predicting Votes with Linear Models.

Furthermore, printing a factor shows slightly different information than printing a character vector. In particular, note that the quotes are not shown and that the levels are explicitly printed in order afterwards:

x <- c("Blue", "Red", "Black", "Blue")
y <- factor(c("Blue", "Red", "Black", "Blue"))
z <- factor(c("Blue", "Red", "Black", "Blue"), 
levels=c("Red", "Black", "Blue")) x #> [1] "Blue" "Red" "Black" "Blue"
y
#> [1] Blue Red Black Blue
#> Levels: Black Blue Red
z
#> [1] Blue Red Black Blue
#> Levels: Red Black Blue
levels(y)
#> [1] "Black" "Blue" "Red"
levels(z)
#> [1] "Red" "Black" "Blue"

Factors can sometimes be tricky to work with because their types are interpreted differently depending on what function is used to operate on them. Remember the class() and typeof() functions we used before? When used on factors, they may produce unexpected results. As you can see below, the class() function will identify x and y as being character and factor, respectively. However, the typeof() function will let us know that they are character and integer, respectively. Confusing isn't it? This happens because, as we mentioned, factors are stored internally as integers, and use a mechanism similar to look-up tables to retrieve the actual string associated for each one.

Technically, the way factors store the strings associated with their integer values is through attributes, which is a topic we will touch on in Chapter 8, Object-Oriented System to Track Cryptocurrencies.
class(x)
#> [1] "character"
class(y)
#> [1] "factor"
typeof(x)
#> [1] "character"
typeof(y)
#> [1] "integer"

While factors look and often behave like character vectors, as we mentioned, they are actually integer vectors, so be careful when treating them like strings. Some string methods, like gsub() and grepl(), will coerce factors to characters, while others, like nchar(), will throw an error, and still others, like c(), will use the underlying integer values. For this reason, it's usually best to explicitly convert factors to the data type you need:

gsub("Black", "White", x)
#> [1] "Blue" "Red" "White" "Blue"
gsub("Black", "White", y)
#> [1] "Blue" "Red" "White" "Blue"
nchar(x)
#> [1] 4 3 5 4
nchar(y)
#> Error in nchar(y): 'nchar()' requires a character vector
c(x)
#> [1] "Blue" "Red" "Black" "Blue"
c(y)
#> [1] 2 3 1 2

If you did not notice, the nchar() applied itself to each of the elements in the x factor. The "Blue", "Red", and "Black" strings have 4, 3, and 5 characters, respectively. This is another example of the vectorized operations we mentioned in the vectors section earlier.

Matrices

Matrices are commonly used in mathematics and statistics, and much of R's power comes from the various operations you can perform with them. In R, a matrix is a vector with two additional attributes, the number of rows and the number of columns. And, since matrices are vectors, they are constrained to a single data type.

You can use the matrix() function to create matrices. You may pass it a vector of values, as well as the number of rows and columns the matrix should have. If you specify the vector of values and one of the dimensions, the other one will be calculated for you automatically to be the lowest number that makes sense for the vector you passed. However, you may specify both of them simultaneously if you prefer, which may produce different behavior depending on the vector you passed, as can be seen in the next example.

By default, matrices are constructed column-wise, meaning that the entries can be thought of as starting in the upper-left corner and running down the columns. However, if you prefer to construct it row-wise, you can send the byrow = TRUE parameter. Also, note that you may create an empty or non-initialized matrix, by specifying the number of rows and columns without passing any actual data for its construction, and if you don't specify anything at all, an uninitialized 1-by-1 matrix will be returned. Finally, note that the same element-repetition mechanism we saw for vectors is applied when creating matrices, so do be careful when creating them this way:

matrix()
#> [,1]
#> [1,] NA

matrix(nrow = 2, ncol = 3)
#> [,1] [,2] [,3]
#> [1,] NA NA NA
#> [2,] NA NA NA

matrix(c(1, 2, 3), nrow = 2)
#> Warning in matrix(c(1, 2, 3), nrow = 2):
data length [3] is not a sub-
#> multiple or multiple of the number of rows [2]
#> [,1] [,2]
#> [1,] 1 3
#> [2,] 2 1

matrix(c(1, 2, 3), nrow = 2, ncol = 3)
#> [,1] [,2] [,3]
#> [1,] 1 3 2
#> [2,] 2 1 3

matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, byrow = TRUE)
#> [,1] [,2] [,3]
#> [1,] 1 2 3
#> [2,] 4 5 6

Matrix subsets can be specified in various ways. Using matrix-like notation, you can specify the row and column selection using the same mechanisms we showed before for vectors, with which you can use vectors with indexes or vectors with logicals, and in case you decide to use vectors with logicals the vector used to subset must be of the same length as the matrix's dimension you are using it for. Since in this case, we have two dimensions to work with, we must separate the selection for rows and columns by using a comma (,) between them (row selection goes first), and R will return their intersection.

For example, x[1, 2] tells R to get the element in the first row and the second column, x[1:2, 1] tells R to get the first through second elements of the third row, which is equivalent to using x[c(1, 2), 3]. You may also use logical vectors for the selection. For example, x[c(TRUE, FALSE), c(TRUE, FALSE, TRUE)] tells R to get the first row while avoiding the second one, and from that row, to get the first and third columns. An equivalent selection is x[1, c(1, 3)]. Note that when you want to specify a single row or column, you can use an integer by itself, but if you want to specify two or more, then you must use vector notation. Finally, if you leave out one of the dimension specifications, R will interpret as getting all possibilities for that dimension:

x <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3, byrow = TRUE)
x[1, 2]
#> [1] 2
x[1:2, 2]
#> [1] 2 5
x[c(1, 2), 3]
#> [1] 3 6
x[c(TRUE, FALSE), c(TRUE, FALSE, TRUE)]
#> [1] 1 3
x[1, c(1, 3)]
#> [1] 1 3
x[, 1]
#> [1] 1 4
x[1, ]
#> [1] 1 2 3

As mentioned earlier, matrices are basic mathematical tools, and R gives you a lot of flexibility when working with them. The most common matrix operation is transposition, which is performed using the t() function, and matrix-vector multiplication, vector-matrix multiplication, and matrix-matrix multiplication, which are performed with the %*% operator we used previously to calculate the dot-product of two vectors.

Note that the same dimensionality restrictions apply as with mathematical notation, meaning that in case you try to perform one of these operations and the dimensions don't make mathematical sense, R will throw an error, as can be seen in the last part of the example:

A <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, byrow = TRUE)
x <- c(7, 8)
y <- c(9, 10, 11)
A
#> [,1] [,2] [,3]
#> [1,] 1 2 3
#> [2,] 4 5 6
x
#> [1] 7 8
y
#> [1] 9 10 11
t(A)
#> [,1] [,2]
#> [1,] 1 4
#> [2,] 2 5
#> [3,] 3 6
t(x)
#> [,1] [,2]
#> [1,] 7 8
t(y)
#> [,1] [,2] [,3]
#> [1,] 9 10 11
x %*% A
#> [,1] [,2] [,3]
#> [1,] 39 54 69
A %*% t(x)
#> Error in A %*% t(x): non-conformable arguments
A %*% y
#> [,1]
#> [1,] 62
#> [2,] 152
t(y) %*% A
#> Error in t(y) %*% A: non-conformable arguments
A %*% t(A)
#> [,1] [,2]
#> [1,] 14 32
#> [2,] 32 77
t(A) %*% A
#> [,1] [,2] [,3]
#> [1,] 17 22 27
#> [2,] 22 29 36
#> [3,] 27 36 45
A %*% x
#> Error in A %*% x: non-conformable arguments

Lists

A list is an ordered collection of objects, like vectors, but lists can actually combine objects of different types. List elements can contain any type of object that exists in R, including data frames and functions (explained in the following sections). Lists play a central role in R due to their flexibility and they are the basis for data frames, object-oriented programming, and other constructs. Learning to use them properly is a fundamental skill for R programmers, and here, we will barely touch the surface, but you should definitely research them further.

For those familiar with Python, R lists are similar to Python dictionaries.

Lists can be explicitly created using the list() function, which takes an arbitrary number of arguments, and we can refer to each of those elements by both position, and, in case they are specified, also by names. If you want to reference list elements by names, you can use the $ notation.

The following example shows how flexible lists can be. It shows that a list that contains numerics, characters, logicals, matrices, and even other lists (these are known as nested lists), and as you can see, we can extract each of those elements to work independently from them.

This is the first time we show a multi-line expression. As you can see, you can do it to preserve readability and avoid having very long lines in your code. Arranging code this way is considered to be a good practice. If you're typing this directly in the console, plus symbols (+) will appear in each new line, as long as you have an unfinished expression, to guide you along.
x <- list(
A = 1,
B = "A",
C = TRUE,
D = matrix(c(1, 2, 3, 4), nrow = 2),
E = list(F = 2, G = "B", H = FALSE)
)

x
#> $A
#> [1] 1
#>
#> $B
#> [1] "A"
#>
#> $C
#> [1] TRUE
#>
#> $D
#> [,1] [,2]
#> [1,] 1 3
#> [2,] 2 4
#>
#> $E
#> $E$F
#> [1] 2
#>
#> $E$G
#> [1] "B"
#>
#> $E$H
#> [1] FALSE

x[1]
#> $A
#> [1] 1

x$A
#> [1] 1

x[2]
#> $B
#> [1] "A"

x$B
#> [1] "A"

When working with lists, we can use the lapply() function to apply a function to each of the elements in a list. In this case, we want to know the class and type of each of the elements in the list we just created:

lapply(x, class)
#> $A
#> [1] "numeric"
#>
#> $B
#> [1] "character"
#>
#> $C
#> [1] "logical"
#>
#> $D
#> [1] "matrix"
#>
#> $E
#> [1] "list"
lapply(x, typeof)
#> $A
#> [1] "double"
#>
#> $B
#> [1] "character"
#>
#> $C
#> [1] "logical"
#>
#> $D
#> [1] "double"
#>
#> $E
#> [1] "list"

Data frames

Now we turn to data frames, which are a lot like spreadsheets or database tables. In scientific contexts, experiments consist of individual observations (rows), each of which involves several different variables (columns). Often, these variables contain different data types, which would not be possible to store in matrices since they must contain a single data type. A data frame is a natural way to represent such heterogeneous tabular data. Every element within a column must be of the same type, but different elements within a row may be of different types, that's why we say that a data frame is a heterogeneous data structure.

Technically, a data frame is a list whose elements are equal-length vectors, and that's why it permits heterogeneity.

Data frames are usually created by reading in a data using the read.table(), read.csv(), or other similar data-loading functions. However, they can also be created explicitly with the data.frame() function or they can be coerced from other types of objects such as lists. To create a data frame using the data.frame() function, note that we send a vector (which, as we know, must contain elements of a single type) to each of the column names we want our data frame to have, which are A, B, and C in this case. The data frame we create below has four rows (observations) and three variables, with numeric, character, and logical types, respectively. Finally, extract subsets of data using the matrix techniques we saw earlier, but you can also reference columns using the $ operator and then extract elements from them:

x <- data.frame(
    A = c(1, 2, 3, 4),
    B = c("D", "E", "F", "G"),
    C = c(TRUE, FALSE, NA, FALSE)
)
x[1, ]
#> A B C
#> 1 1 D TRUE
x[, 1]
#> [1] 1 2 3 4
x[1:2, 1:2]
#> A B
#> 1 1 D
#> 2 2 E
x$B
#> [1] D E F G
#> Levels: D E F G
x$B[2]
#> [1] E
#> Levels: D E F G

Depending on how the data is organized, the data frame is said to be in either wide or narrow formats (https://en.wikipedia.org/wiki/Wide_and_narrow_data). Finally, if you want to keep only observations for which you have complete cases, meaning only rows that don't contain any NA values for any of the variables, then you should use the complete.cases() function, which returns a logical vector of length equal to the number of rows, and which contains a TRUE value for those rows that don't have any NA values and FALSE for those that have at least one such value.

Note that when we created the x data frame, the C column contains an NA in its third value. If we use the complete.cases() function on x, then we will get a FALSE value for that row and a TRUE value for all others. We can then use this logical vector to subset the data frame just as we have done before with matrices. This can be very useful when analyzing data that may not be clean, and for which you only want to keep those observations for which you have full information:

x
#> A B C
#> 1 1 D TRUE
#> 2 2 E FALSE
#> 3 3 F NA
#> 4 4 G FALSE

complete.cases(x)
#> [1] TRUE TRUE FALSE TRUE
x[complete.cases(x), ]
#> A B C
#> 1 1 D TRUE
#> 2 2 E FALSE
#> 4 4 G FALSE