Using SparkR for EDA and data munging tasks
In this section, we will use Spark SQL and SparkR for preliminary exploration of our Datasets. The examples in this chapter use several publically available Dataset to illustrate the operations and be run in the SparkR shell.
The entry point into SparkR is the SparkSession. It connects the R program to a Spark cluster. If you are working in the SparkR shell, the SparkSession is already created for you.
At this time, start SparkR shell, as shown:
Aurobindos-MacBook-Pro-2:spark-2.2.0-bin-hadoop2.7 aurobindosarkar$./bin/SparkR
You can install the required libraries, such as ggplot2, in your SparkR shell, as shown:
> install.packages('ggplot2', dep = TRUE)
Reading and writing Spark DataFrames
SparkR supports operating on a variety of sources through the DataFrames interface. SparkR's DataFrames supports a number of methods to read input, perform structured data analysis, and write DataFrames to the distributed storage.
The read.df
method can be used...