This example uses the much larger diabetes dataset. Since most of the variables in this dataset are numeric, OneR can bin all of them:
- First, read the Spark diabetes table using SQL, which has already been registered in a previous chapter.
- Collect a 15% random sample of the data and assign it to an R (not Spark!) dataframe named "local".
- Bin all of the available variables based upon their ability to predict the outcome and assign it to an R dataframe named "data":
library(OneR)
df = sql("SELECT outcome, age, mass, triceps, pregnant,
glucose, pressure, insulin, pedigree
FROM global_temp.df_view")
local = collect(sample(df, F,.15))
data <- optbin(local,outcome~.)
summary(data)

- Run the OneR model using all of the variables to predict the outcome. Recall that the outcome...