Book Image

Mastering Parallel Programming with R

By : Simon R. Chapple, Terence Sloan, Thorsten Forster, Eilidh Troup
Book Image

Mastering Parallel Programming with R

By: Simon R. Chapple, Terence Sloan, Thorsten Forster, Eilidh Troup

Overview of this book

R is one of the most popular programming languages used in data science. Applying R to big data and complex analytic tasks requires the harnessing of scalable compute resources. Mastering Parallel Programming with R presents a comprehensive and practical treatise on how to build highly scalable and efficient algorithms in R. It will teach you a variety of parallelization techniques, from simple use of R’s built-in parallel package versions of lapply(), to high-level AWS cloud-based Hadoop and Apache Spark frameworks. It will also teach you low level scalable parallel programming using RMPI and pbdMPI for message passing, applicable to clusters and supercomputers, and how to exploit thousand-fold simple processor GPUs through ROpenCL. By the end of the book, you will understand the factors that influence parallel efficiency, including assessing code performance and implementing load balancing; pitfalls to avoid, including deadlock and numerical instability issues; how to structure your code and data for the most appropriate type of parallelism for your problem domain; and how to extract the maximum performance from your R code running on a variety of computer systems.
Table of Contents (13 chapters)

Variants on lapply()


And finally, to end our tour of MPI, we come almost full circle in a sense. Just as in Chapter 1, Simple Parallelism with R, where R's core parallel package provides specific versions of lapply() that make it very simple to run a function in parallel, Rmpi and pbdMPI also provide their own lapply() variants.

parLapply() with Rmpi

Here we revisit the basic operation of parLapply() ( Chapter 1, Simple Parallelism with R) in conjunction with MPI. We hinted back then that an MPI cluster can be used with parLapply(), and this indeed can be done by introducing an additional package called snow, an abbreviation that stands for Simple Network Of Workstations (SNOW). All that we need to do is to install the snow package from CRAN, load the libraries in the correct order, and create the cluster using Rmpi thus (note that pbdMPI is not compatible with parLapply()):

> library("snow")
> library("Rmpi")
> library("parallel")
Attaching package: 'parallel'
The following objects...