While OneR is very good at determining simple classification rules, it is not able to construct full decision trees. However, we can extract a sample from Spark and route it to any R decision tree algorithm, such as rpart.
To illustrate this, let's first take a 50% sample of the stop and frisk dataframe. We also want to make sure that the amount of data we extract can be processed easily by base R, which has a memory limitation that is dependent upon the CPU size.
- The code below will first extract a 50% sample from Spark and store it in a local R dataframe named
dflocal
. - Then it will run an
str()
command to verify the rowcount and the metadata:
dflocal = collect(sample(df, F,.50,123)) str(dflocal)
The output indicates that there are 11,311 rows, which is roughly 50% of the 22,563 rows from the Stop and Frisk data.