Apache Spark 1.4 release added SparkR, an R package on top of Spark, which allowed data analysts and data scientists to analyze large datasets and run jobs interactively using R language on Spark platforms.
R is one of the most popular open source statistical programming languages with a huge number (over 7,000) of community-supported packages. R packages help in statistical analysis, machine learning, and visualization of data. Interactive analytics in R is limited by single-threaded processes and memory limitation, which means that R can process data sets that fit in a single computer's memory only. SparkR is an R package developed at the AMPLab of University of California, which provides features of R on distributed computation engines of Spark, which enables us to run large-scale data analytics interactively using R. This chapter is divided into the following topics:
Introducing R and SparkR
Getting started with SparkR
Using DataFrames with...