Book Image

R Statistics Cookbook

By : Francisco Juretig
2 (2)
Book Image

R Statistics Cookbook

2 (2)
By: Francisco Juretig

Overview of this book

R is a popular programming language for developing statistical software. This book will be a useful guide to solving common and not-so-common challenges in statistics. With this book, you'll be equipped to confidently perform essential statistical procedures across your organization with the help of cutting-edge statistical tools. You'll start by implementing data modeling, data analysis, and machine learning to solve real-world problems. You'll then understand how to work with nonparametric methods, mixed effects models, and hidden Markov models. This book contains recipes that will guide you in performing univariate and multivariate hypothesis tests, several regression techniques, and using robust techniques to minimize the impact of outliers in data.You'll also learn how to use the caret package for performing machine learning in R. Furthermore, this book will help you understand how to interpret charts and plots to get insights for better decision making. By the end of this book, you will be able to apply your skills to statistical computations using R 3.5. You will also become well-versed with a wide array of statistical techniques in R that are extensively used in the data science industry.
Table of Contents (12 chapters)

C++ in R via the Rcpp package

The Rcpp package has become one of the most important packages for R. Essentially, it allows us to integrate C++ code into our R scripts seamlessly. The main advantage of this, is that we can achieve major efficiency gains, especially if our code needs to use lots of loops. A second advantage, is that C++ has a library called the standard template library (STL) that has very efficient containers (for example, vectors and linked lists) that are extremely fast and well suited for many programming tasks. Depending on the case, you could expect to see improvements anywhere from 2x to 50x. This is why many of the most widely used packages have been rewritten to leverage Rcpp. In Rcpp, we can code in C++ using the typical C++ syntax, but we also have specific containers that are well suited to store R-specific elements. For example, we can use the Rcpp::NumericVector or the Rcpp::DataFrame, which are certainly not native C++ variable types.

The traditional way of using Rcpp involves writing the code in C++ and sourcing it in R. Since Rcpp 0.8.3, we can also use the Rcpp sugar style, which is a different way of coding in C++. Rcpp sugar brings some elements of the high-level R syntax into C++, allowing us to achieve the same things with a much more concise syntax.

We can use Rcpp in two ways, using the functions in an inline way, or coding them in a separate script and sourcing it via sourcecpp(). The latter approach is slightly preferable as it clearly separates the C++ and R code into different files.

Getting ready

In order to run this code, Rcpp needs to be installed using install.packages ("Rcpp").

How to do it...

In this case, we will compare R against Rcpp (C++) for the following task. We will load a vector and a matrix. Our function will loop through each element of the vector, through each row, and through each row and column of that matrix, counting the number of instances where the elements of the matrix are greater than the ones in the vector.

  1. Save the C++ code into a file named rcpp_example.cpp and we will source it from an R script:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int bring_element (NumericVector rand_vector, NumericMatrix rand_matrix) {
Rcout << "Process starting" << std::endl;
int mcounter = 0;
for (int q = 0; q < rand_vector.size();q++){
for (int x = 0; x < rand_matrix.rows();x++){
for (int y = 0; y < rand_matrix.cols();y++){
double v1 = rand_matrix.at(x,y);
double v2 = rand_vector[q];
if ( v1 < v2){
mcounter++;
}
}
}
}
Rcout << "Process ended" << std::endl;
return mcounter;
}
  1. In the corresponding R script, we need the following code:
library(Rcpp)
sourceCpp("./rcpp_example.cpp")
Rfunc <- function(rand__vector,rand_matrix){
mcounter = 0
for (q in 1:length(rand__vector)){
for (x in 1:dim(rand_matrix)[1]){
for (y in 1:dim(rand_matrix)[2]){
v1 = rand_matrix[x,y];
v2 = rand__vector[q];
if ( v1 < v2){
mcounter = mcounter+1
}
}
}
}
return (mcounter)
}
  1. Generate a vector and a matrix of random Gaussian numbers:
some__matrix = replicate(500, rnorm(20))
some__vector = rnorm(100)
  1. Save the starting and end times for the Rcpp function, and subtract the starting time from the end time:
start_time <- Sys.time()
bring_element(some__vector,some__matrix)
end_time <- Sys.time()
print(end_time - start_time)
  1. Do the same as we did in the previous step, but for the R function:
start_time <- Sys.time()
Rfunc(some__vector,some__matrix)
end_time <- Sys.time()
print(end_time - start_time)

The C++ function takes 0.10 seconds to complete, whereas the R one takes 0.21 seconds. This is to be expected, as R loops are generally slow, but they are extremely fast in C++:

How it works...

The sourcecpp function loads the C++ or Rcpp code from an external script and compiles it. In this C++ file, we first include the Rcpp headers that will allow us to add R functionality to our C++ script. We use using namespace Rcpp to tell the compiler that we want to work with that namespace (so we avoid the need to type Rcpp:: when using Rcpp functionality). The Rcpp::export declaration tells the compiler that we want to export this function to R.

Our bring_element function will return an integer (so that is why we have an int in its declaration). The arguments will be NumericVector and NumericMatrix, which are not C++ native variable types but Rcpp ones. These allow us to use vectors and matrices that operate with numbers without needing to declare explicitly if we will be working with integers, large integers, or float numbers. The Rcout function allows us to print output from C++ to the R console. We then loop through the columns and rows from the vector and the matrix, using standard C++ syntax. What is not standard here is the way we can get the number of columns and rows in these elements (using .rows() and .cols()), since these attributes are available through Rcpp. The R code is quite straightforward, and we then run it using a standard timer.

See also

The homepage for Rcpp is hosted at http://www.rcpp.org.

Rcpp is deeply connected to the Armadillo library (used for very high-performance algebra). This allows us to include this excellent library in our Rcpp projects.

The Rinside package (http://dirk.eddelbuettel.com/code/rinside.html), built by the creators of Rcpp, allows us to embed R code easily inside our C++ projects. These C++ projects can be compiled and run without requiring R. The implications of the Rinside package are enormous, since we can code highly performant statistical applications in C++ that we can distribute as standalone executable programs.