Book Image

R Programming By Example

By : Omar Trejo Navarro
Book Image

R Programming By Example

By: Omar Trejo Navarro

Overview of this book

R is a high-level statistical language and is widely used among statisticians and data miners to develop analytical applications. Often, data analysis people with great analytical skills lack solid programming knowledge and are unfamiliar with the correct ways to use R. Based on the version 3.4, this book will help you develop strong fundamentals when working with R by taking you through a series of full representative examples, giving you a holistic view of R. We begin with the basic installation and configuration of the R environment. As you progress through the exercises, you'll become thoroughly acquainted with R's features and its packages. With this book, you will learn about the basic concepts of R programming, work efficiently with graphs, create publication-ready and interactive 3D graphs, and gain a better understanding of the data at hand. The detailed step-by-step instructions will enable you to get a clean set of data, produce good visualizations, and create reports for the results. It also teaches you various methods to perform code profiling and performance enhancement with good programming practices, delegation, and parallelization. By the end of this book, you will know how to efficiently work with data, create quality visualizations and reports, and develop code that is modular, expressive, and maintainable.
Table of Contents (12 chapters)

Divide and conquer with functions

Functions are a fundamental building block of R. To master many of the more advanced techniques in this book, you need a solid foundation in how they work. We've already used a few functions above since you can't really do anything interesting in R without them. They are just what you remember from your mathematics classes, a way to transform inputs into outputs. Specifically in R, a function is an object that takes other objects as inputs, called arguments, and returns an output object. Most functions are in the following form f(argument_1, argument_2, ...). Where f is the name of the function, and argument_1, argument_2, and so on are the arguments to the function.

Before we continue, we should briefly mention the role of curly braces ({}) in R. Often they are used to group a set of operations in the body of a function, but they can also be used in other contexts (as we will see in the case of the web application we will build in Chapter 10, Adding Interactivity with Dashboards). Curly braces are used to evaluate a series of expressions, which are separated by newlines or semicolons, and return only the last expression as a result. For example, the following line only prints the x + y operation to the screen, hiding the output of the x * y operation, which would have been printed had we typed the expressions step by step. In this sense, curly braces are used to encapsulate a set of behavior and only provide the result from the last expression:

{ x <- 1; y <- 2; x * y; x + y }
#> [1] 3

We can create our own function by using the function() constructor and assign it to a symbol. The function() constructor takes an arbitrary number of named arguments, which can be used within the body of the function. Unnamed arguments can also be passed using the "..." argument notation, but that's an advanced technique we won't look at in this book. Feel free to read the documentation for functions to learn more about them.

When calling the function, arguments can be passed by position or by name. The positional order must correspond to the order provided in the function's signature (that is, the function() specification with the corresponding arguments), but when using named arguments, we can send them in whatever order we prefer. As the following example shows.

In the following example, we create a function that calculates the Euclidian distance (https://en.wikipedia.org/wiki/Euclidean_distance) between two numeric vectors, and we show how the order of the arguments can be changed if we use named arguments. To realize this effect, we use the print() function to make sure we can see in the console what R is receiving as the x and y vectors. When developing your own programs, using the print() function in similar ways is very useful to understand what's happening.

Instead of using the function name like euclidian_distance, we will use l2_norm because it's the generalized name for such an operation when working with spaces of arbitrary number dimensions and because it will make a follow-up example easier to understand. Note that even though outside the function call our vectors are called a and b, since they are passed into the x and y arguments, those are the names we need to use within our function. It's easy for beginners to confuse these objects as being the same if we had used the x and y names in both places:

l2_norm <- function(x, y) {
    print("x")
    print(x)
    print("y")
    print(y)
    element_to_element_difference <- x - y
    result <- sum(element_to_element_difference^2)
    return(result)
}

a <- c(1, 2, 3)
b <- c(4, 5, 6)

l2_norm(a, b)
#> [1] "x"
#> [1] 1 2 3
#> [1] "y"
#> [1] 4 5 6
#> [1] 27

l2_norm(b, a)
#> [1] "x"
#> [1] 4 5 6
#> [1] "y"
#> [1] 1 2 3
#> [1] 27

l2_norm(x = a, y = b)
#> [1] "x"
#> [1] 1 2 3
#> [1] "y"
#> [1] 4 5 6
#> [1] 27

l2_norm(y = b, x = a)
#> [1] "x"
#> [1] 1 2 3
#> [1] "y"
#> [1] 4 5 6
#> [1] 27

Functions may use the return() function to specify the value returned by the function. However, R will simply return the last evaluated expression as the result of a function, so you may see code that does not make use of the return() function explicitly.

Our previous l2_norm() function implementation seems to be somewhat cluttered. If the function has a single expression, then we can avoid using the curly braces, which we can achieve by removing the print() function calls and avoid creating intermediate objects, and since we know that it's working fine, we can do so without hesitation. Furthermore, we avoid explicitly calling the return() function to simplify our code even more. If we do so, our function looks much closer to its mathematical definition and is easier to understand, isn't it?

l2_norm <- function(x, y) sum((x - y)^2)

Furthermore, in case you did not notice, since we use vectorized operations, we can send vectors of different lengths (dimensions), provided that both vectors share the same length, and the function will work just as we expect it to, without regard for the dimensionality of the space we're working with. As I had mentioned earlier, vectorization can be quite powerful. In the following example, we show such behavior with vectors of dimension 1 (mathematically known as scalars), as well as vectors of dimension 5, created with the ":" shortcut syntax:

l2_norm(1, 2)
#> [1] 1
l2_norm(1:5, 6:10)
#> [1] 125

Before we move on, I just want to mention that you should always make an effort to follow the Single Responsibility principle, which states that each object (functions in this case) should focus on doing a single thing, and do it very well. Whenever you describe a function you created as doing "something" and "something else," you're probably doing it wrong since the "and" should let you know that the function is doing more than one thing, and you should split it into two or more functions that possibly call each other. To read more about good software engineering principles, take a look at Martin's great book title Agile Software Development, Principles, Patterns, and Practices, Pearson, 2002.

Optional arguments

When creating functions, you may specify a default value for an argument, and if you do, then the argument is considered optional. If you do not specify a default value for an argument, and you do not specify a value when calling a function, you will get an error if the function attempts to use the argument.

In the following example, we show that if a single numeric vector is passed to our l2_norm() function as it stands, it will throw an error, but if we redefine it to make the second vector optional, then we will simply return the first vector's norm, not the distance between two different vectors To accomplish this, we will provide a zero-vector of length one, but because R repeats vector elements until all the vectors involved in an operation are of the same length, as we saw before in this chapter, it will automatically expand our zero-vector into the appropriate dimension:

l2_norm(a)     # Should throw an error because `y` is missing
#> Error in l2_norm(a): argument "y" is missing, with no default l2_norm <- function(x, y = 0) sum((x - y)^2) l2_norm(a) # Should work fine, since `y` is optional now #> [1] 14 l2_norm(a, b) # Should work just as before
#> [1] 27

As you can see, now our function can optionally receive the y vector, but will also work as expected without it. Also, note that we introduced some comments into our code. Anything that comes after the # symbol in a line, R will ignore, which allows us to explain our code where need be. I prefer to avoid using comments because I tend to think that code should be expressive and communicate its intention without the need for comments, but they are actually useful every now and then.

Functions as arguments

Sometimes when you want to generalize functions, you may want to plug in a certain functionality into a function. You can do that in various ways. For example, you may use conditionals, as we will see in the following section in this chapter, to provide them with different functionality based on context. However, conditional should be avoided when possible because they can introduce unnecessary complexity into our code. A better solution would be to pass a function as a parameter which will be called when appropriate, and if we want to change how a function behaves, we can change the function we're passing through for a specific task.

That may sound complicated, but in reality, it's very simple. Let's start by creating a l1_norm() function that calculates the distance between two vectors but uses the sum of absolute differences among corresponding coordinates instead of the sum of squared differences as our l2_norm() function does. For more information, take a look at the Taxicab geometry article on Wikipedia (https://en.wikipedia.org/wiki/Taxicab_geometry).

Note that we use the same signature for our two functions, meaning that both receive the same required as well as optional arguments, which are x and y in this case. This is important because if we want to change the behavior by switching functions, we must make sure they are able to work with the same inputs, otherwise, we may get unexpected results or even errors:

l1_norm <- function(x, y = 0) sum(abs(x - y))

l1_norm(a)
#> [1] 6
l1_norm(a, b)
#> [1] 9

Now that our l2_norm() and l1_norm() are built so that they can be switched among themselves to provide different behavior, we will create a third distance() function, which will take the two vectors as arguments, but will also receive a norm argument, which will contain the function we want to use to calculate the distance.

Note that we are specifying that we want to use the l2_norm() by default in case there's no explicit selection when calling the function, and to do so we simply specify the symbol that contains the function object, without parenthesis. Finally note, that if we want to avoid sending the y vector, but we want to specify what norm should be used, then we must pass it through as a named argument, otherwise R would interpret the second argument as the y vector, not the norm function:

distance <- function(x, y = 0, norm = l2_norm) norm(x, y)

distance(a)
#> [1] 14
distance(a, b)
#> [1] 27
distance(a, b, l2_norm)
#> [1] 27
distance(a, b, l1_norm)
#> [1] 9
distance(a, norm = l1_norm)
#> [1] 6

Operators are functions

Now that you have a working understanding of how functions work. You should know that not all function calls look like the ones we have shown so far, where you use the name of the function followed by parentheses that contains the function's arguments. Actually, all statements in R, including setting variables and arithmetic operations, are functions in the background, even if we mostly call them with a different syntax.

Remember that previously in this chapter we mentioned that R objects could be referred to by almost any string, but you should avoid doing so. Well here we show how using cryptic names can be useful under certain contexts. The following example shows how the assignment, selection, and addition operators are usually used with sugar syntax (a term used to describe syntax that exists for ease of use), but that in the background they use the functions named [<-, [, and +, respectively.

The [<-() function receives three arguments: the vector we want to modify, the position we want to modify in the vector, and the value we want it to have at that position. The [() function receives two arguments, the vector from which we want to retrieve a value and the position of the value we want to retrieve. Finally, the +() function receives the two values we want to add. The following example shows the syntax sugar, followed by the background function calls R performs for us:

x <- c(1, 2, 3, 4, 5)
x
#> [1] 1 2 3 4 5
x[1] <- 10
x
#> [1] 10 2 3 4 5
`[<-`(x, 1, 20)
#> [1] 20 2 3 4 5
x
#> [1] 10 2 3 4 5
x[1]
#> [1] 10
`[`(x, 1)
#> [1] 10
x[1] + x[2]
#> [1] 12
`+`(x[1], x[2])
#> [1] 12
`+`(`[`(x, 1), `[`(x, 1))
#> [1] 20

In practice, you would probably never write these statements as explicit function calls. The syntax sugar is much more intuitive and much easier to read. However, to use some of the advanced techniques shown in this book, it is helpful to know that every operation in R is a function.

Coercion

Finally, we will briefly mention what coercion is in R since it's a topic of confusion for newcomers. When you call a function with an argument of a different type than what was expected, R will try to coerce values so that the function will work, and this can introduce bugs if not handled correctly. R will follow a mechanism similar to what was used when creating vectors.

Strongly typed languages (like Java) will raise exceptions when the object passed to a function is of the wrong type, and will try not to convert the object to a compatible type. However, as we mentioned earlier, R was designed to work out of the box with a lot of unforeseen contexts, so coercion was introduced.

In the following example, we show that if we call our distance() function and pass logical vectors instead of numeric ones, R will coerce the logical vectors into numeric vectors, using TRUE as 1 and FALSE as 0, and proceed with the calculations. To avoid this issue in your own programs, you should coerce data types explicitly with the as.*() functions we mentioned before:

x <- c(1, 2, 3)
y <- c(TRUE, FALSE, TRUE)
distance(x, y)
#> [1] 8