In the first section of this chapter, we described Bayes' theorem. Recall that it is given by the following:
Let's rewrite Bayes' theorem in terms that are more natural for a classification task:
In the preceding formula, y is the positive class, x1 is the first feature for the instance, and n is the number of features. P(B) is constant for all inputs, so we can omit it; the probability of observing a particular feature in the training set does not vary for different test instances. This leaves two terms: the prior class probability, P(y), and the conditional probability, P(x1, ..., xn|y). Naive Bayes estimates these terms using maximum a posteriori estimation. P(y) is simply the frequency of each class in the training set. For categorical features, P(xi|y) is simply the frequency of the feature in the training instances belonging to that class. Its estimate is given by the following formula:
The numerator is the number of times that the feature appears in training samples of class...