At the beginning of this section, we will try to classify the corpus using algorithms we have already discussed (Naïve Bayes and k-NN). We will then briefly discuss two new algorithms: logistic regression and support vector machines.
We know k-Nearest Neighbors, so we'll just jump into the classification. We will try with three neighbors and five neighbors:
1 library(class) # knn() is in the class packages 2 library(caret) # confusionMatrix is in the caret package 3 set.seed(975) 4 Class3n = knn(TrainDF[,-1], TrainDF[,-1], TrainDF[,1], k = 3) 5 Class5n = knn(TrainDF[,-1], TrainDF[,-1], TrainDF[,1], k = 5) 6 confusionMatrix(Class3n,as.factor(TrainDF$quality))
The confusion matrix and the following statistics (the output has been partially reproduced) show that classification with three neighbors doesn't seem too bad: the accuracy is 0.74; yet, the kappa value is not good (it should be at least 0.60):
Confusion Matrix and Statistics...