Book Image

Mastering Parallel Programming with R

By : Simon R. Chapple, Terence Sloan, Thorsten Forster, Eilidh Troup
Book Image

Mastering Parallel Programming with R

By: Simon R. Chapple, Terence Sloan, Thorsten Forster, Eilidh Troup

Overview of this book

R is one of the most popular programming languages used in data science. Applying R to big data and complex analytic tasks requires the harnessing of scalable compute resources. Mastering Parallel Programming with R presents a comprehensive and practical treatise on how to build highly scalable and efficient algorithms in R. It will teach you a variety of parallelization techniques, from simple use of R’s built-in parallel package versions of lapply(), to high-level AWS cloud-based Hadoop and Apache Spark frameworks. It will also teach you low level scalable parallel programming using RMPI and pbdMPI for message passing, applicable to clusters and supercomputers, and how to exploit thousand-fold simple processor GPUs through ROpenCL. By the end of the book, you will understand the factors that influence parallel efficiency, including assessing code performance and implementing load balancing; pitfalls to avoid, including deadlock and numerical instability issues; how to structure your code and data for the most appropriate type of parallelism for your problem domain; and how to extract the maximum performance from your R code running on a variety of computer systems.
Table of Contents (13 chapters)

Preface

We are in the midst of an information explosion. Everything in our lives is becoming instrumented and connected in real-time with the Internet of Things, from our own biology to the world's environment. By some measures, it is projected that by 2020, world data will have grown by more than a factor of 10 from today to a staggering 44 Zettabytes—just one Zettabyte is the equivalent of 250 billion DVDs. In order to process this volume and velocity of big data, we need to harness a vast amount of compute, memory, and disk resources, and to do this, we need parallelism.

Despite its age, R—the open source statistical programming language, continues to grow in popularity as one of the key cornerstone technologies to analyze data, and is used by an ever-expanding community of, dare I say the currently in-vogue designation of, "data scientists".

There are of course many other tools that a data scientist may deploy in taming the beast of big data. You may also be a Python, SAS, SPSS, or MATLAB guru. However, R, with its long open source heritage since 1997, remains pervasive, and with the extraordinarily wide variety of additional CRAN-hosted plug-in library packages that were developed over the intervening 20 years, it is highly capable of almost all forms of data analysis, from small numeric matrices to very large symbolic datasets, such as bio-molecular DNA. Indeed, I am tempted to go as far as to suggest that R is becoming the de facto data science scripting language, which is capable of orchestrating highly complex analytics pipelines that involve many different types of data.

R, in itself, has always been a single-threaded implementation, and it is not designed to exploit parallelism within its own language primitives. Instead, it relies on specifically implemented external package libraries to achieve this for certain accelerated functions and to enable the use of parallel processing frameworks. We will focus on a select number of these that represent the best implementations that are available today to develop parallel algorithms across a range of technologies.

In this book, we will cover many different aspects of parallelism, from Single Program Multiple Data (SPMD) to Single Instruction Multiple Data (SIMD) vector processing, including utilizing R's built-in multicore capabilities with its parallel package, message passing using the Message Passing Interface (MPI) standard, and General Purpose GPU (GPGPU)-based parallelism with OpenCL. We will also explore different framework approaches to parallelism, from load balancing through task farming to spatial processing with grids. We will touch on more general purpose batch-data processing in the cloud with Hadoop and (as a bonus) the hot new tech in cluster computing, Apache Spark, which is much better suited to real-time data processing at scale.

We will even explore how to use a real bona fide multi-million pound supercomputer. Yes, I know that you may not own one of these, but in this book, we'll show you what its like to use one and how much performance parallelism can achieve. Who knows, with your new found knowledge, maybe you can rock up at your local Supercomputer Center and convince them to let you spin up some massively parallel computing!

All of the coding examples that are presented in this book are original work and have been chosen partly so as not to duplicate the kind of example you might otherwise encounter in other books of this nature. They are also chosen to hopefully engage you, dear reader, with something a little bit different to the run-of-the-mill. We, the authors, very much hope you enjoy the journey that you are about to undertake through Mastering Parallel Programming in R.

What this book covers

Chapter 1, Simple Parallelism with R, starts our journey by quickly showing you how to exploit the multicore processing capability of your own laptop using core R's parallelized versions of lapply(). We also briefly reach out and touch the immense computing capacity of the cloud through Amazon Web Services.

Chapter 2, Introduction to Message Passing, covers the standard Message Passing Interface (MPI), which is a key technology that implements advanced parallel algorithms. In this chapter, you will learn how to use two different R MPI packages, Rmpi and pbdMPI, together with the OpenMPI implementation of the underlying communications subsystem.

Chapter 3, Advanced Message Passing, will complete our tour of MPI by developing a detailed Rmpi worked example, illustrating the use of nonblocking communications and localized patterns of interprocess message exchange, which is required to implement spatial Grid parallelism.

Chapter 4, Developing SPRINT, an MPI-based R Package for Supercomputers, introduces you to the experience of running parallel code on a real supercomputer. This chapter also provides a detailed exposition of developing SPRINT, an R package written in C for parallel computation that can run on laptops, as well as supercomputers. We'll also show you how you can extend this package with your own natively-coded high performance parallel algorithms and make them accessible to R.

Chapter 5, The Supercomputer in Your Laptop, will show how to unlock the massive parallel and vector processing capability of the Graphics Processing Unit (GPU) inside your very own laptop direct from R using the ROpenCL package, an R wrapper for the Open Computing Language (OpenCL).

Chapter 6, The Art of Parallel Programming, concludes this book by providing the basic science behind parallel programming and its performance, the art of best practice by highlighting a number of potential pitfalls you'll want to avoid, and taking a glimpse into the future of parallel computing systems.

Online Chapter, Apache Spa-R-k, is an introduction to Apache Spark, which now succeeds Hadoop as the most popular distributed memory big data parallel computing environment. You will learn how to setup and install a Spark cluster and how to utilize Spark's own DataFrame abstraction direct from R. This chapter can be downloaded from Packt's website at https://www.packtpub.com/sites/default/files/downloads/B03974_BonusChapter.pdf

You don't need to read this book in order from beginning to end, although you will find this easiest with respect to the introduction of concepts, and the increasing technical depth of programming knowledge applied. For the most part, each chapter has been written to be understandable when read on it's own.

What you need for this book

To run the code in this book, you will require a multicore modern specification laptop or desktop computer. You will also require a decent bandwidth Internet connection to download R and the various R code libraries from CRAN, the main online repository for R packages.

The examples in this book have largely been developed using RStudio version 0.98.1062, with the 64-bit R version 3.1.0 (CRAN distribution), running on a mid-2014 generation Apple MacBook Pro OS X 10.9.4, with a 2.6 GHz Intel Core i5 processor and 16 GB of memory. However, all of these examples should also work with the latest version of R.

Some of the examples in this book will not be able to run with Microsoft Windows, but they should run without problem on variants of Linux. Each chapter will detail any required additional external libraries or runtime system requirements, and provide you with information on how to access and install them. This book's errata section will highlight any issues discovered post publication.

Who this book is for

This book is for the intermediate to advanced-level R developer who wants to understand how to harness the power of parallel computing to perform long running computations and analyze large quantities of data. You will require a reasonable knowledge and understanding of R programming. You should be a sufficiently capable programmer so that you can read and understand lower-level languages, such as C/C++, and be familiar with the process of code compilation. You may consider yourself to be the new breed of data scientist—a skilled programmer as well as a mathematician.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "You'll note the use of mpi.cart.create(), which constructs a Cartesian rank/grid mapping from a group of existing MPI processes."

A block of code is set as follows:

Worker_makeSquareGrid <- function(comm,dim) {
  grid <- 1000 + dim    # assign comm handle for this size grid
  dims <- c(dim,dim)    # dimensions are 2D, size: dim X dim
  periods <- c(FALSE,FALSE)  # no wraparound at outermost edges 
  if (mpi.cart.create(commold=comm,dims,periods,commcart=grid))  
  {
    return(grid)
  }
  return(-1) # An MPI error occurred
}

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

# Namespace file for sprint 

useDynLib(sprint)

export(phello)
export(ptest)
export(pcor)

Any command-line input or output is written as follows:

$ mpicc -o mpihello.o mpihello.c 
$ mpiexec -n 4 ./mpihello.o

New terms and important words are shown in bold.

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail , and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

  1. Log in or register to our website using your e-mail address and password.

  2. Hover the mouse pointer on the SUPPORT tab at the top.

  3. Click on Code Downloads & Errata.

  4. Enter the name of the book in the Search box.

  5. Select the book for which you're looking to download the code files.

  6. Choose from the drop-down menu where you purchased this book from.

  7. Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR / 7-Zip for Windows

  • Zipeg / iZip / UnRarX for Mac

  • 7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/repository-name. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from http://www.packtpub.com/sites/default/files/downloads/MasteringParallelProgrammingwithR_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at , and we will do our best to address the problem.