Next, we will combine the training (grp=1
) and testing (grp=0
) datasets into one dataframe and manually calculate some accuracy statistics:
preds$error
: this is the absolute difference between the outcome (0,1) and the prediction. Recall that for a binary regression model, the prediction represents the probability that the event (diabetes) will occur.preds$errorsqr
: this is the calculated squared error. This is done in order to remove the sign.preds$correct
: in order to classify the probability into correct or not correct, we will compare the error to a.5
cutoff. If the error was small (<-.5
) we will call it correct, otherwise it will be considered not correct. This is a somewhat arbitrary cutoff, and it is used to determine which category to place the prediction in.
As a final step, we will once again separate the data back into test and training based upon the grp
flag:
#classify 'correct' prediction if error is less than or equal to .5 preds...