This example uses the much larger diabetes dataset. Since most of the variables in this dataset are numeric, OneR
can bin all of them:
- First, read the Spark diabetes table using SQL, which has already been registered in a previous chapter.
- Collect a 15% random sample of the data and assign it to an R (not Spark!) dataframe named "local".
- Bin all of the available variables based upon their ability to predict the outcome and assign it to an R dataframe named "data":
library(OneR) df = sql("SELECT outcome, age, mass, triceps, pregnant, glucose, pressure, insulin, pedigree FROM global_temp.df_view") local = collect(sample(df, F,.15)) data <- optbin(local,outcome~.) summary(data)
- Run the
OneR
model using all of the variables to predict the outcome. Recall that the outcome is an indication of whether or not diabetes is present:
model <- OneR(data, outcome~., verbose = TRUE) summary(model) ...