Book Image

Simulation for Data Science with R

By : Matthias Templ
Book Image

Simulation for Data Science with R

By: Matthias Templ

Overview of this book

Data Science with R aims to teach you how to begin performing data science tasks by taking advantage of Rs powerful ecosystem of packages. R being the most widely used programming language when used with data science can be a powerful combination to solve complexities involved with varied data sets in the real world. The book will provide a computational and methodological framework for statistical simulation to the users. Through this book, you will get in grips with the software environment R. After getting to know the background of popular methods in the area of computational statistics, you will see some applications in R to better understand the methods as well as gaining experience of working with real-world data and real-world problems. This book helps uncover the large-scale patterns in complex systems where interdependencies and variation are critical. An effective simulation is driven by data generating processes that accurately reflect real physical populations. You will learn how to plan and structure a simulation project to aid in the decision-making process as well as the presentation of results. By the end of this book, you reader will get in touch with the software environment R. After getting background on popular methods in the area, you will see applications in R to better understand the methods as well as to gain experience when working on real-world data and real-world problems.
Table of Contents (18 chapters)
Simulation for Data Science with R
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
Index

Simulation and big data


Big unstructured data is often analyzed nowadays, or used as auxiliary information. Running simulations on big data is a challenge.

Big data can be too large to fit in the memory of a desktop computer. Whenever this happens, basically three options are available to choose from. The first option is to just use a more powerful server with a larger amount of memory for your computations. Second, the data can also be stored efficiently in a database and we connect to the database for analysis. Typically, only a subset of the data is of interest, so we can just grab the interesting parts of the database, import it to R, do the analysis, and export the results back to the database. Since R has excellent features and APIs for connecting to well-known databases, this is the recommended approach. Third, aggregate and subset your data first. It is likely that you don't need such detailed information for your analysis. To give an example, imagine you want to analyze road traffic data. The measurement units on a highway usually report the speed, the distance to the next car, the lane used, and the kind of vehicle. If this is so, the data will be really huge. However, for an analysis of measurement faults, it is enough to aggregate the data and analyze 1- or 5-minute interval data.

As soon as the data sets become large, resampling methods such as the Bootstrap might cause long computation times. The usual method is to repeatedly sample data, estimate the statistic of interest, and save the results. Whenever simulations can be rerun under different settings, one may change this approach. For a method such as the Bootstrap, an additional vector can be stored for each Bootstrap sample containing information about how often a sample is included in a Bootstrap sample. Thus, instead of storing the Bootstrap sample, we store a vector of 0,1,... that expresses whether a unit is included (and how often) in a Bootstrap sample. With this approach, Bootstrap samples must only be selected ones. This approach is especially useful when the Bootstrap sample selection is of a more complex nature because of special sampling designs, and simulations should be re-run again.

If estimations are done repeatedly, as in the Monte Carlo approach, it is very important to have fast implementations of the estimators in the software available. It is recommended to run everything vectorized, meaning that any function call applied to a vector or any other data structure operates directly on all of its values/elements. It is then crucial that this loop is implemented in a compiled language such as C or C++, as is the case with more elementary functions of base R. R provides a powerful interface to foreign languages. The use of foreign languages is often much more efficient than using non-compiled, interpreted R language directly. Appropriately implemented apply-like functions over vectors, matrices, or data frames can even be very simply parallelized. The trick is then just to use, for example, instead of an lapply, the call mclapply from the R package parallel (R Core Team 2015).

Resampling methods can be easily run on parallel processes. Monte Carlo techniques were originally designed for machines with a single processor. With today's high performance computing possibilities, the calculation can often be done by the use of many processors running in parallel (Kroese et al., 2014). Monte Carlo techniques perform efficiently in the parallel processing framework. In R, for example, only very few modifications are needed to run a Monte Carlo approach in parallel. However, when data sets get large, parallel computing might only be faster than single-core computing in Linux or OS X operating systems, since both support forking, while Microsoft Windows does not support this to the same extent.

A related issue is the use of effective random number generation techniques for parallel computing. It must be ensured that the random numbers used by separate workers are produced independently, otherwise the workers will return the same results. To facilitate reproducible research, it is thus necessary to provide different initializations of random numbers to all workers (see also Schmidberger et al., 2009). For example, the package rlecuyer (Sevcikova and Rossini 2015) supports these issues.