Evaluation
Once we have chosen a dataset, we will want to use it to train a classifier and see how well that classifier works. Assume that we have a dataset stored in the dataset
variable and a classifier stored in classifier
. The first thing we have to do is to split the dataset into two parts—one, stored in training
, to be used for training the classifier, and one, stored in testing
, for testing it. There are two obvious constraints on the way we do this split, as outlined here:
training
andtesting
must be disjoint. This is essential. If they are not, then there is a trivial classifier that will get everything 100% correct—namely, just remember all the examples you have seen. Even ignoring this trivial case, classifiers will generally perform better on datasets that they have been trained on than on unseen cases, but when a classifier is deployed in the field, the vast majority of cases will be unknown to it, so testing should always be done on unseen data...