In this section, we try out some useful, commonly used operations. First, we try out the traditional R/dplyr
operations and then show equivalent operations using the SparkR API:
> //Open the R shell and NOT SparkR shell > library(dplyr,warn.conflicts=FALSE) //Load dplyr first //Perform a common, useful operation > iris %>% + group_by(Species) %>% + summarise(avg_length = mean(Sepal.Length), + avg_width = mean(Sepal.Width)) %>% + arrange(desc(avg_length)) Source: local data frame [3 x 3] Species avg_length avg_width (fctr) (dbl) (dbl) 1 virginica 6.588 2.974 2 versicolor 5.936 2.770 3 setosa 5.006 3.428 //Remove from R environment > detach("package:dplyr",unload=TRUE)
This operation is very similar to the SQL group and is followed by order. Its equivalent implementation in SparkR is also very similar to the dplyr
example. Look at the following example...