R is a language and environment for statistical computing and graphics. SparkR is an R package that provides a lightweight frontend to enable Apache Spark access from R. The goal of SparkR is to combine the flexibility and ease of use provided by the R environment and the scalability and fault tolerance provided by the Spark compute engine. Let us recap the Spark architecture before discussing how SparkR realizes its goal.
Apache Spark is a fast, general-purpose, fault-tolerant framework for interactive and iterative computations on large, distributed datasets. It supports a wide variety of data sources as well as storage layers. It provides unified data access to combine different data formats, streaming data and defining complex operations using high-level, composable operators. You can develop your applications interactively using Scala, Python, or R shell (or Java without a shell). You can deploy it on your home desktop or you can run it on large clusters of thousands of...