Book Image

R High Performance Programming

Book Image

R High Performance Programming

Overview of this book

Table of Contents (17 chapters)
R High Performance Programming
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Preface

In a world where data is becoming increasingly important, business people and scientists need tools to analyze and process large volumes of data efficiently. R is one of the tools that have become increasingly popular in recent years for data processing, statistical analysis, and data science. While R has its roots in academia, it is now used by organizations across a wide range of industries and geographical areas.

But the design of R imposes some inherent limits on the size of the data and the complexity of computations that it can manage efficiently. This can be a huge obstacle for R users who need to process the ever-growing volume of data in their organizations.

This book, R High Performance Programming, will help you understand the situations that often pose performance difficulties in R, such as memory and computational limits. It will also show you a range of techniques to overcome these performance limits. You can choose to use these techniques alone, or in various combinations that best fit your needs and your computing environment.

This book is designed to be a practical guide on how to improve the performance of R programs, with just enough explanation of why, so that you understand the reasoning behind each solution. As such, we will provide code examples for every technique that we cover in this book, along with performance profiling results that we generated on our machines to demonstrate the performance improvements. We encourage you to follow along by entering and running the code in your own environment to see the performance improvements for yourself.

If you would like to understand how R is designed and why it has performance limitations, the R Internals documentation (http://cran.r-project.org/doc/manuals/r-release/R-ints.html) will provide helpful clues.

This book is written based on open source R because it is the most widely used version of R and is freely available to anybody. If you are using a commercial version of R, check with your software vendor to see what performance improvements they might have made available to you.

The R community has created many new packages to improve the performance of R, which are available on the Comprehensive R Archive Network (CRAN) (http://cran.r-project.org/). We cannot analyze every package on CRAN—there are thousands of them—to see if they provide performance enhancements for specific operations. Instead, this book focuses on the most common tasks for R programmers and introduces techniques that you can use on any R project.

What this book covers

Chapter 1, Understanding R's Performance – Why Are R Programs Sometimes Slow?, kicks off our journey by taking a peek under R's hood to explore the various ways in which R programs can hit performance limits. We will look at how R's design sometimes creates performance bottlenecks in R programs in terms of computation (CPU), memory (RAM), and disk input/output (I/O).

Chapter 2, Profiling – Measuring Code's Performance, introduces a few techniques that we will use throughout the book to measure the performance of R code, so that we can understand the nature of our performance problems.

Chapter 3, Simple Tweaks to Make R Run Faster, describes how to improve the computational speed of R code. These are basic techniques that you can use in any R program.

Chapter 4, Using Compiled Code for Greater Speed, explores the use of compiled code in another programming language such as C to maximize the performance of our computations. We will see how compiled code can perform faster than R, and look at how to integrate compiled code into our R programs.

Chapter 5, Using GPUs to Run R Even Faster, brings us to the realm of modern accelerators by leveraging Graphics Processing Units (GPUs) to run complex computations at high speed.

Chapter 6, Simple Tweaks to Use Less RAM, describes the basic techniques to manage and optimize RAM utilization of your R programs to allow you to process larger datasets.

Chapter 7, Processing Large Datasets with Limited RAM, explains how to process datasets that are larger than the available RAM using memory-efficient data structures and disk resident data formats.

Chapter 8, Multiplying Performance with Parallel Computing, introduces parallelism in R. We will explore how to run code in parallel in R on a single machine and on multiple machines. We will also look at the factors that need to be considered in the design of our parallel code.

Chapter 9, Offloading Data Processing to Database Systems, describes how certain computations can be offloaded to an external database system. This is useful to minimize Big Data movements in and out of the database, and especially when you already have access to a powerful database system with computational power and speed for you to leverage.

Chapter 10, R and Big Data, concludes the book by exploring the use of Big Data technologies to take R's performance to the limit.

If you are in a hurry, we recommend that you read the following chapters first, then supplement your reading with other chapters that are relevant for your situation:

  • Chapter 1, Understanding R's Performance – Why Are R Programs Sometimes Slow?

  • Chapter 2, Profiling – Measuring Code's Performance

  • Chapter 3, Simple Tweaks to Make R Run Faster

  • Chapter 6, Simple Tweaks to Use Less RAM

What you need for this book

All the codes in this book were developed in R 3.1.1 64-bit on Mac OS X 10.9. Wherever possible, they have also been tested on Ubuntu desktop 14.04 LTS and Windows 8.1. All code examples can be downloaded from https://github.com/r-high-performance-programming/rhpp-2015.

To follow along the code examples, we recommend you to install R 3.1.1 64-bit or a later version in your environment.

We also recommend you to run R in a Unix environment (this includes Linux and Mac OS X). While R runs on Windows, some packages that we will use, for example, "bigmemory" runs only in a Unix environment. Whenever there are differences between Unix and Windows in our code examples, we will indicate them.

You will need the 64-bit version of R, as certain operations (for example, creating a vector with 231 or more elements) are not possible in the 32-bit version. Also, the 64-bit version of R can make use of as much memory as is available on your system, whereas the 32-bit version is limited to not more than 4 GB of memory (on some operating systems, the limit can be as low as 2 GB).

You will also need to install packages in your R environment, as the examples in several chapters will depend on additional packages.

The examples in some chapters require other software or packages to run. These will be listed in the respective chapters along with installation instructions.

If you do not have access to some of the software and tools required for the examples, you can run them on Amazon Web Services (AWS). In particular, the examples in Chapter 5, Using GPUs to Run R Even Faster, require a computer with an NVIDIA GPU with CUDA capabilities; those in Chapter 9, Offloading Data Processing to Database Systems, require various database systems; and those in Chapter 10, R and Big Data, require Hadoop.

To use AWS, log in to http://aws.amazon.com/ with your Amazon account. Create an account if you do not have one. Creating an account is free, but there are charges for using servers, storage, and other resources. Consult the AWS website for the latest prices in your preferred region.

AWS services are provided in different regions around the world. At the time of writing this book, there are eight regions—three in the United States, one in Europe, three in the Asia Pacific, and one in South America. Pick any region you like, such as the one closest to where you are or the one with the lowest prices. To select a region, go to AWS Console (http://console.aws.amazon.com) and select the region in the upper-right corner. Once you have selected a region, use the same region for all the AWS resources you need for the examples in this book.

Before setting up any compute resource, such as a server or Hadoop cluster, you need a key pair to log in to the server. If you do not already have an AWS Elastic Compute Cloud (EC2) key pair, follow these steps to generate one:

  1. Go to AWS Console and click on EC2.

  2. Click on Key Pairs in the menu on the left.

  3. Click on Create Key Pair.

  4. Enter a name for the new key pair (for example, mykey).

  5. Once you click on Create, the private key (for example, mykey.pem) will be downloaded on your computer.

On Linux and Mac OS X, change the permissions of the private key file to allow only the read access to the owner; this can be done with chmod 400 mykey.pem in a Terminal window.

Who this book is for

If you are already an R programmer and you want to find ways to improve the efficiency of your code, then this book is for you. While you need to be familiar with and comfortable using R, you do not need deep expertise in the language. The skills that you need to benefit from this book are:

  • Installing, upgrading and running R on your computer

  • Installing and upgrading CRAN packages within your R environment

  • Creating and manipulating basic data structures like vectors, matrices, lists, and data frames

  • Using and converting between different R data types

  • Performing arithmetic, logical, and other basic R operations

  • Using R control statements such as if, for, while, and repeat

  • Writing R functions

  • Plotting charts using R Graphics

If you are new to R and want to learn how to write R programs, there are many books, online courses, tutorials, and other resources available. Just search for them using your favorite search engine.

Conventions

In this book, you will find a number of styles of text that distinguish among different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "To compile the function, we will use the cmpfun() function in the compiler package."

A block of code is set as follows:

fibonacci_rec <- function(n) {
    if (n <= 1) {
        return(n)
    }
    return(fibonacci_rec(n - 1) + fibonacci_rec(n - 2))
}

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Be sure to select the Package authoring installation and Edit the system PATH options in the installation wizard."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to , and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at if you are having a problem with any aspect of the book, and we will do our best to address it.