One of the first things I do upon creating a new data object, is to run summary statistics. There is a Spark-specific function of the R summary function known as describe(). You can the specific function summary(); however, if you do this instead of using describe(), I would preface it with SparkR::
in order to specify which version of summary you are using:
head(SparkR::summary(out_sd))
The output appears in a slightly different format than if you ran a summary on a native R dataframe, but contains the basic measures that you are looking for, count
, mean
, stddev
, min
, and max
:
We can also compare this summary with the summary of the original Pima Indians dataframe, and see that the simulation has done a pretty good job of estimating the means. The number of observations is approximately 1,000 times the original size and the ratio of diabetes to nondiabetes patients has been preserved:
#compare with original dataset summary(PimaIndiansDiabetes[,])