R Programming By Example

By : Omar Trejo Navarro

R Programming By Example

By: Omar Trejo Navarro

Overview of this book

R is a high-level statistical language and is widely used among statisticians and data miners to develop analytical applications. Often, data analysis people with great analytical skills lack solid programming knowledge and are unfamiliar with the correct ways to use R. Based on the version 3.4, this book will help you develop strong fundamentals when working with R by taking you through a series of full representative examples, giving you a holistic view of R. We begin with the basic installation and configuration of the R environment. As you progress through the exercises, you'll become thoroughly acquainted with R's features and its packages. With this book, you will learn about the basic concepts of R programming, work efficiently with graphs, create publication-ready and interactive 3D graphs, and gain a better understanding of the data at hand. The detailed step-by-step instructions will enable you to get a clean set of data, produce good visualizations, and create reports for the results. It also teaches you various methods to perform code profiling and performance enhancement with good programming practices, delegation, and parallelization. By the end of this book, you will know how to efficiently work with data, create quality visualizations and reports, and develop code that is modular, expressive, and maintainable.

Preface

What this book covers

What you need for this book

Free Chapter

Introduction to R

What R is and what it isn't

Comparing R with other software

The interpreter and the console

Tools to work efficiently with R

How to use this book

Tracking state with symbols and variables

Working with data types and data structures

Divide and conquer with functions

Complex logic with control structures

The examples in this book

Summary

Understanding Votes with Descriptive Statistics

This chapter's required packages

The Brexit votes example

Cleaning and setting up the data

Summarizing the data into a data frame

Getting intuition with graphs and correlations

Creating a new dataset with what we've learned

Building new variables with principal components

Putting it all together into high-quality code

Summary

Predicting Votes with Linear Models

Required packages

Setting up the data

Predicting votes with linear models

Checking model assumptions

Measuring accuracy with score functions

Programatically finding the best model

Predicting votes from wards with unknown data

Summary

Simulating Sales Data and Working with Databases

Required packages

Designing our data tables

Simulating the sales data

Simulating the client data

Simulating the client messages data

Working with relational databases

Summary

Communicating Sales with Visualizations

Required packages

Extending our data with profit metrics

Building blocks for reusable high-quality graphs

Starting with simple applications for bar graphs

Graphing disaggregated data with boxplots

Scatter plots with joint and marginal distributions

Developing our own graph type – radar graphs

Exploring with interactive 3D scatter plots

Looking at dynamic data with time-series

Looking at geographical data with static maps

Navigating geographical data with interactive maps

Summary

Understanding Reviews with Text Analysis

This chapter's required packages

What is text analysis and how does it work?

Preparing, training, and testing data

Building the corpus with tokenization and data cleaning

Training models with cross validation

Improving our results with TF-IDF

Adding flexibility with N-grams

Reducing dimensionality with SVD

Extending our analysis with cosine similarity

Digging deeper with sentiment analysis

Testing our predictive model with unseen data

Retrieving text data from Twitter

Summary

Developing Automatic Presentations

Required packages

Why invest in automation?

Literate programming as a content creation methodology

The basic tools for an automation pipeline

A gentle introduction to Markdown

Header Level 1

Extending Markdown with R Markdown

Developing graphs and analysis as we normally would

Building our presentation with R Markdown

Summary

Object-Oriented System to Track Cryptocurrencies

This chapter's required packages

The cryptocurrencies example

A brief introduction to object-oriented programming

Introducing three object models in R – S3, S4, and R6

The architecture behind our cryptocurrencies system

Starting simple with timestamps using S3 classes

Implementing cryptocurrency assets using S4 classes

Implementing our storage layer with R6 classes

Retrieving live data for markets and wallets with R6 classes

Finally introducing users with S3 classes

Helping ourselves with a centralized settings file

Saving our initial user data into the system

Activating our system with two simple functions

Some advice when working with object-oriented systems

Summary

Implementing an Efficient Simple Moving Average

Required packages

Starting by using good algorithms

How fast is fast enough?

Calculating simple moving averages inefficiently

Understanding why R can be slow

Measuring by profiling and benchmarking

Easily achieving high benefit - cost improvements

Using parallelization to divide and conquer

Using C++ and Fortran to accelerate calculations

Looking back at what we have achieved

Other topics of interest to enhance performance

Summary

Adding Interactivity with Dashboards

Required packages

What is functional reactive programming and why is it useful?

Designing our high-level application structure

Inserting a dynamic data table

Introducing interactivity with user input

Adding a summary table with shared data

Adding a simple moving average graph

Adding interactivity with a secondary zoom-in graph

Styling our application with themes

Complex logic with control structures

The final topic we should cover is how to introduce complex logic by using control structures. When I write introduce complex logic, I don't mean to imply that it's complex to do so. Complex logic refers to code that has multiple possible paths of execution, but in reality, it's quite simple to implement it.

Nearly every operation in R can be written as a function, and these functions can be passed through to other functions to create very complex behavior. However, it isn't always convenient to implement logic that way and using simple control structures may be a better option sometimes.

The control structures we will look at are if... else conditionals, for loops, and while loops. There are also switch conditionals, which are very much like if... else conditionals, but we won't look at them since we won't use them in the examples contained in this book.

If… else conditionals

As their name states, if…else conditionals will check a condition, and if it is evaluated to be a TRUE value, one path of execution will be taken, but if the condition is evaluated to be a FALSE value, a different path of execution will be taken, and they are mutually exclusive.

To show how if... else conditions work, we will program the same distance() function we used before, but instead of passing it the third argument in the form of a function, we will pass it a string that will be checked to decide which function should be used. This way you can compare different ways of implementing the same functionality. If we pass the l2 string to the norm argument, then the l2_norm() function will be used, but if any other string is passed through, the l1_norm() will be used. Note that we use the double equals operator (==) to check for equality. Don't confuse this with a single equals, which means assignment:

distance <- function(x, y = 0, norm = "l2") {
    if (norm == "l2") {
        return(l2_norm(x, y))
    } else {
        return(l1_norm(x, y))
    }
}

a <- c(1, 2, 3)
b <- c(4, 5, 6)

distance(a, b)
#> 27
distance(a, b, "l2")
#> 27
distance(a, b, "l1")
#> 9
distance(a, b, "l1 will also be used in this case")
#> 9

As can be seen in the last line of the previous example, using conditionals in a non-rigorous manner can introduce potential bugs, as in this case we used the l1_norm() function, even when the norm argument in the last function call did not make any sense at all. To avoid such situations, we may introduce the more conditionals to exhaust all valid possibilities and throw an error, with the stop() function, if the else branch is executed, which would mean that no valid option was provided:

distance <- function(x, y = 0, norm = "l2") {
    if (norm == "l2") {
        return(l2_norm(x, y))
    } else if (norm == "l1") {
        return(l1_norm(x, y))
    } else {
        stop("Invalid norm option")
    }
}

distance(a, b, "l1")
#> [1] 9
distance(a, b, "this will produce an error")
#> Error in distance(a, b, "this will produce an error") :
#>   Invalid norm option

Sometimes, there's no need for the else part of the if... else condition. In that case, you can simply avoid putting it in, and R will execute the if branch if the condition is met and will ignore it if it's not.

There are many different ways to generate the logical values that can be used within the if() check. For example, you could specify an optional argument with a NULL default value and check whether it was not sent in the function call by checking whether the corresponding variable still contains the NULL object at the time of the check, using the is.null() function. The actual condition would look something like if(is.null(optional_argument)). Other times you may get a logical vector, and if a single one of its values is TRUE, then you want to execute a piece of code, in that case you can use something like if(any(logical_vector)) as the condition, or in case you require that all of the values in the logical vector are TRUE to execute a piece of code, then you can use something like if(all(logical_vector)). The same logic can be applied to the self-descriptive functions named is.na() and is.nan().

Another way to generate these logical values is using the comparison operators. These include less than (<), less than or equal to (<=), greater than (>), greater than or equal to (>=), exactly equal (which we have seen ,==), and not equal to (!=). All of these can be used to test numerics as well as characters, in which case alphanumerical order is used. Furthermore, logical values can be combined among themselves to provide more complex conditions. For example, the ! operator will negate a logical, meaning that if !TRUE is equal to FALSE, and !FALSE is equal to TRUE. Other examples of these types of operators are the OR operator where in case any of the logical values is TRUE, then the whole expression evaluates to TRUE, and the AND operator where all logical must be TRUE to evaluate to TRUE. Even though we don't show specific examples of the information mentioned in the last two paragraphs, you will see it used in the examples we will develop in the rest of the book.

Finally, note that a vectorized form of the if... else conditional is available under the ifelse() function. In the following code we use the modulo operator in the conditional, which is the first argument to the function, to identify which values are even, in which case we use the TRUE branch which is the second argument to indicate that the integer is even, and which are not, in which case we use the FALSE branch which is the third argument to indicate that the integer is odd:

ifelse(c(1, 2, 3, 4, 5, 6) %% 2 == 0, "even", "odd")
#> [1] "odd" "even" "odd" "even" "odd" "even"

For loops

There are two important properties of for loops. First, results are not printed inside a loop unless you explicitly call the print() function. Second, the indexing variable used within a for loop will be changed, in order, after each iteration. Furthermore, to stop iterating you can use the keyword break, and to skip to the next iteration you can use the next command.

For this first example, we create a vector of characters called words, and iterate through each of its elements in order using the for (word in words) syntax. Doing so will take the first element in words, assign it to word, and pass it through the expression defined in the block defined by the curly braces, which in this case print the word to the console, as well as the number of characters in the word. When the iteration is finished, word will be updated with the next word, and the loop will be repeated this way until all words have been used:

words <- c("Hello", "there", "dear", "reader")
for (word in words) {
    print(word)
    print(nchar(word))
}
#> [1] "Hello"
#> [1] 5
#> [1] "there"
#> [1] 5
#> [1] "dear"
#> [1] 4
#> [1] "reader"
#> [1] 6

Interesting behavior can be achieved by using nested for loops which are for loops inside other for loops. In this case, the same logic applies, when we encounter a for loop we execute it until completion. It's easier to see the result of such behavior than explaining it, so take a look at the behavior of the following code:

for (i in 1:5) {
    print(i)
    for (j in 1:3) {
        print(paste("   ", j))
    }
}
#> [1] 1
#> [1] " 1"
#> [1] " 2"
#> [1] " 3"
#> [1] 2
#> [1] " 1"
#> [1] " 2"
#> [1] " 3"
#> [1] 3
#> [1] " 1"
#> [1] " 2"
#> [1] " 3"
#> [1] 4
#> [1] " 1"
#> [1] " 2"
#> [1] " 3"
#> [1] 5
#> [1] " 1"
#> [1] " 2"
#> [1] " 3"

Using such nested for loops is how people perform matrix-like operations when using languages that do not offer vectorized operations. Luckily, we can use the syntax shown in previous sections to perform those operations without having to use nested for-loops ourselves which can be tricky at times.

Now, we will see how to use the sapply() and lapply() functions to apply a function to each element of a vector. In this case, we will call use the nchar() function on each of the elements in the words vector we created before. The difference between the sapply() and the lapply() functions is that the first one returns a vector, while the second returns a list. Finally, note that explicitly using any of these functions is unnecessary, since, as we have seen before in this chapter, the nchar() function is already vectorized for us:

sapply(words, nchar)
#> Hello there dear reader
#> 5     5     4    6
lapply(words, nchar)
#> [[1]]
#> [1] 5
#>
#> [[2]]
#> [1] 5
#>
#> [[3]]
#> [1] 4
#>
#> [[4]]
#> [1] 6
nchar(words)
#> [1] 5 5 4 6

When you have a function that has not been vectorized, like our distance() function. You can still use it in a vectorized way by making use of the functions we just mentioned. In this case we will apply it to the x list which contains three different numeric vectors. We will use the lapply() function by passing it the list, followed by the function we want to apply to each of its elements (distance() in this case). Note that in case the function you are using receives other arguments apart from the one that will be taken from x and which will be passed as the first argument to such function, you can pass them through after the function name, like we do here with the c(1, 1, 1) and l1_norm arguments, which will be received by the distance() function as the y and norm arguments, and will remain fixed for all the elements of the x list:

x <- list(c(1, 2, 3), c(4, 5, 6), c(7, 8, 9))
lapply(x, distance, c(1, 1, 1), l1_norm)
#> [[1]]
#> [1] 3
#>
#> [[2]]
#> [1] 12
#>
#> [[3]]
#> [1] 21

While loops

Finally, we will take a look at the while loops which use a different way of looping than for loops. In the case of for loops, we know the number of elements in the object we use to iterate, so we know in advance the number of iterations that will be performed. However, there are times where we don't know this number before we start iterating, and instead, we will iterate based on some condition being true after each iteration. That's when while loops are useful.

The way while loops work is that we specify a condition, just as with if…else conditions, and if the condition is met, then we proceed to iterate. When the iteration is finished, we check the condition again, and if it continues to be true, then we iterate again, and so on. Note that in this case if we want to stop at some point, we must modify the elements used in the condition such that it evaluates to FALSE at some point. You can also use break and next inside the while loops.

The following example shows how to print all integers starting at 1 and until 10. Note that if we start at 1 as we do, but instead of adding 1 after each iteration, we subtracted 1 or didn't change x at all, then we would never stop iterating. That's why you need to be very careful when using while loops since the number of iterations can be infinite:

x <- 1
while (x <= 10) {
    print(x)
    x <- x + 1
}
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
#> [1] 5
#> [1] 6
#> [1] 7
#> [1] 8
#> [1] 9
#> [1] 10

In case you do want to execute an infinite loop, you may use the while loop with a TRUE value in the place of the conditional. If you do not include a break command, the code will effectively provide an infinite loop, and it will repeat itself until stopped with the CTRL + C keyboard command or any other stopping mechanism in the IDE you're using. However, in such cases, it's cleaner to use the repeat construct as is shown below. It may seem counter intuitive, but there are times when using infinite loops is useful. We will see one such case in Chapter 8, Object-Oriented System to Track Cryptocurrencies, but in such cases, you have an external mechanism used to stop the program based on a condition external to R.

Executing the following example will crash your R session:

# DO NOTE EXCEUTE THIS, IT's AN INFINITE LOOP

x <- 1
repeat {
    print(x)
    x <- x + 1
}

#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
#> [1] 5
#> [1] 5
...

R Programming By Example

By : Omar Trejo Navarro

R Programming By Example

By: Omar Trejo Navarro

Overview of this book

Related Content you might be interested in

Current Title:

R Programming By Example

Web Application Development with R Using Shiny

Mastering Machine Learning with R

R Data Analysis Cookbook