R High Performance Programming

Book Image

R High Performance Programming

Book Image

R High Performance Programming

Overview of this book

R High Performance Programming

R High Performance Programming

Credits

About the Authors

About the Authors

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Understanding R's Performance – Why Are R Programs Sometimes Slow?

Understanding R's Performance – Why Are R Programs Sometimes Slow?

Three constraints on computing performance – CPU, RAM, and disk I/O

R is interpreted on the fly

R is single-threaded

R requires all data to be loaded into memory

Algorithm design affects time and space complexity

Profiling – Measuring Code's Performance

Profiling – Measuring Code's Performance

Measuring total execution time

Profiling the execution time

Profiling memory utilization

Monitoring memory utilization, CPU utilization, and disk I/O using OS tools

Identifying and resolving bottlenecks

Simple Tweaks to Make R Run Faster

Simple Tweaks to Make R Run Faster

Use of built-in functions

Preallocating memory

Use of simpler data structures

Use of hash tables for frequent lookups on large data

Seeking fast alternative packages in CRAN

Using Compiled Code for Greater Speed

Using Compiled Code for Greater Speed

Compiling R code before execution

Using compiled languages in R

Using GPUs to Run R Even Faster

Using GPUs to Run R Even Faster

General purpose computing on GPUs

Fast statistical modeling in R with gputools

Simple Tweaks to Use Less RAM

Simple Tweaks to Use Less RAM

Reusing objects without taking up more memory

Removing intermediate data when it is no longer needed

Calculating values on the fly instead of storing them persistently

Swapping active and nonactive data

Processing Large Datasets with Limited RAM

Processing Large Datasets with Limited RAM

Using memory-efficient data structures

Using memory-mapped files and processing data in chunks

Multiplying Performance with Parallel Computing

Multiplying Performance with Parallel Computing

Data parallelism versus task parallelism

Implementing data parallel algorithms

Implementing task parallel algorithms

Executing tasks in parallel on a cluster of computers

Shared memory versus distributed memory parallelism

Optimizing parallel performance

Offloading Data Processing to Database Systems

Offloading Data Processing to Database Systems

Extracting data into R versus processing data in a database

Preprocessing data in a relational database using SQL

Converting R expressions to SQL

Running statistical and machine learning algorithms in a database

Using columnar databases for improved performance

Using array databases for maximum scientific-computing performance

R and Big Data

Understanding Hadoop

Setting up Hadoop on Amazon Web Services

Processing large datasets in batches using Hadoop

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Use of hash tables for frequent lookups on large data

One common task in data analysis is data lookup, which is often implemented via a list in R. For example, to look up customers' ages, we can define a list, say, cust_age, with values set to customer ages and names set to the corresponding customer names (or IDs), that is names(cust_age) <- cust_name. In this case, to look up John Doe's age, the following can be called: cust_age[["John_Doe"]]. However, the implementation of lists in R is not optimized for lookup; it incurs O(N) time complexity to perform a lookup on a list of N elements. This means that the values indexed later in the list require more time to look up. As N grows, this effect gets stronger. When a program requires frequent lookups, the cumulative effect can be significant. An alternative to lists that offers a more optimized data lookup is a hash table. In R, this is available from the CRAN package hash. A hash table's lookup incurs O(1) time complexity.

The next code...