Once you have extracted your sample, you can run normal R functions such as pairs to generate a correlation matrix, or use the reshape2
package along with ggplot
to generate a correlation plot.
#this takes our "collect()" data frame which we exported from Spark, and runs a basic correlation matrix pairs(samp[,3:8], col=samp$outcome)
Here is a more sophisticated visualization which uses ggplot to illustrate how to generate a correlation matrix using shading to indicate the degree of correlation for each of the intersecting variables. Again, the point is to emphasis that you can perform analysis outside of Spark if your sample size is reasonable, and the exact functionality you need is not available in the version of Spark you are running.
require(ggplot2) library(reshape2) cormatrix <- round(cor(samp),2) cormatrix_melt <- melt(cormatrix) head(cormatrix_melt) ggplot(data = cormatrix_melt...