-
Book Overview & Buying
-
Table Of Contents
Python Machine Learning by Example - Third Edition
By :
Beyond accuracy, there are several metrics we can use to gain more insight and to avoid class imbalance effects. These are as follows:
A confusion matrix summarizes testing instances by their predicted values and true values, presented as a contingency table:

Table 2.3: Contingency table for a confusion matrix
To illustrate this, we can compute the confusion matrix of our Naïve Bayes classifier. We use the confusion_matrix function from scikit-learn to compute it, but it is very easy to code it ourselves:
>>> from sklearn.metrics import confusion_matrix
>>> print(confusion_matrix(Y_test, prediction, labels=[0, 1]))
[[ 60 47]
[148 431]]
As you can see from the resulting confusion matrix, there are 47 false positive cases (where the model misinterprets a dislike as a like for a movie), and 148 false negative cases (where it fails to detect a like for a movie). Hence, classification accuracy is just the proportion of all true cases:

Precision measures the fraction of positive calls that are correct, which is
, and
in our case.
Recall, on the other hand, measures the fraction of true positives that are correctly identified, which is
and
in our case. Recall is also called the true positive rate.
The f1 score comprehensively includes both the precision and the recall, and equates to their harmonic mean:
. We tend to value the f1 score above precision or recall alone.
Let's compute these three measurements using corresponding functions from scikit-learn, as follows:
>>> from sklearn.metrics import precision_score, recall_score, f1_score
>>> precision_score(Y_test, prediction, pos_label=1)
0.9016736401673641
>>> recall_score(Y_test, prediction, pos_label=1)
0.7443868739205527
>>> f1_score(Y_test, prediction, pos_label=1)
0.815515610217597
On the other hand, the negative (dislike) class can also be viewed as positive, depending on the context. For example, assign the 0 class as pos_label and we have the following:
>>> f1_score(Y_test, prediction, pos_label=0)
0.38095238095238093
To obtain the precision, recall, and f1 score for each class, instead of exhausting all class labels in the three function calls as shown earlier, a quicker way is to call the classification_report function:
>>> from sklearn.metrics import classification_report
>>> report = classification_report(Y_test, prediction)
>>> print(report)
precision recall f1-score support
0.0 0.29 0.56 0.38 107
1.0 0.90 0.74 0.82 579
micro avg 0.72 0.72 0.72 686
macro avg 0.60 0.65 0.60 686
weighted avg 0.81 0.72 0.75 686
Here, weighted avg is the weighted average according to the proportions of the class.
The classification report provides a comprehensive view of how the classifier performs on each class. It is, as a result, useful in imbalanced classification, where we can easily obtain a high accuracy by simply classifying every sample as the dominant class, while the precision, recall, and f1 score measurements for the minority class, however, will be significantly low.
Precision, recall, and the f1 score are also applicable to multiclass classification, where we can simply treat a class we are interested in as a positive case, and any other classes as negative cases.
During the process of tweaking a binary classifier (that is, trying out different combinations of hyperparameters, for example, the smoothing factor in our Naïve Bayes classifier), it would be perfect if there was a set of parameters in which the highest averaged and class individual f1 scores are achieved at the same time. It is, however, usually not the case. Sometimes, a model has a higher average f1 score than another model, but a significantly low f1 score for a particular class; sometimes, two models have the same average f1 scores, but one has a higher f1 score for one class and a lower score for another class. In situations such as these, how can we judge which model works better? The area under the curve (AUC) of the receiver operating characteristic (ROC) is a consolidated measurement frequently used in binary classification.
The ROC curve is a plot of the true positive rate versus the false positive rate at various probability thresholds, ranging from 0 to 1. For a testing sample, if the probability of a positive class is greater than the threshold, then a positive class is assigned; otherwise, we use a negative class. To recap, the true positive rate is equivalent to recall, and the false positive rate is the fraction of negatives that are incorrectly identified as positive. Let's code and exhibit the ROC curve (under thresholds of 0.0, 0.1, 0.2, …, 1.0) of our model:
>>> pos_prob = prediction_prob[:, 1]
>>> thresholds = np.arange(0.0, 1.1, 0.05)
>>> true_pos, false_pos = [0]*len(thresholds), [0]*len(thresholds)
>>> for pred, y in zip(pos_prob, Y_test):
... for i, threshold in enumerate(thresholds):
... if pred >= threshold:
... # if truth and prediction are both 1
... if y == 1:
... true_pos[i] += 1
... # if truth is 0 while prediction is 1
... else:
... false_pos[i] += 1
... else:
... break
Then, let's calculate the true and false positive rates for all threshold settings (remember, there are 516.0 positive testing samples and 1191 negative ones):
>>> n_pos_test = (Y_test == 1).sum()
>>> n_neg_test = (Y_test == 0).sum()
>>> true_pos_rate = [tp / n_pos_test for tp in true_pos]
>>> false_pos_rate = [fp / n_neg_test for fp in false_pos]
Now, we can plot the ROC curve with Matplotlib:
>>> import matplotlib.pyplot as plt
>>> plt.figure()
>>> lw = 2
>>> plt.plot(false_pos_rate, true_pos_rate,
... color='darkorange', lw=lw)
>>> plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
>>> plt.xlim([0.0, 1.0])
>>> plt.ylim([0.0, 1.05])
>>> plt.xlabel('False Positive Rate')
>>> plt.ylabel('True Positive Rate')
>>> plt.title('Receiver Operating Characteristic')
>>> plt.legend(loc="lower right")
>>> plt.show()
Refer to Figure 2.8 for the resulting ROC curve:

Figure 2.8: ROC curve
In the graph, the dashed line is the baseline representing random guessing, where the true positive rate increases linearly with the false positive rate; its AUC is 0.5. The solid line is the ROC plot of our model, and its AUC is somewhat less than 1. In a perfect case, the true positive samples have a probability of 1, so that the ROC starts at the point with 100% true positive and 0% false positive. The AUC of such a perfect curve is 1. To compute the exact AUC of our model, we can resort to the roc_auc_score function of scikit-learn:
>>> from sklearn.metrics import roc_auc_score
>>> roc_auc_score(Y_test, pos_prob)
0.6857375752586637
What AUC value leads to the conclusion that a classifier is good? Unfortunately, there is no such "magic" number. We use the following rule of thumb as general guidelines: classification models achieving an AUC of 0.7 to 0.8 are considered acceptable, 0.8 to 0.9 are great, and anything above 0.9 are superb. Again, in our case, we are only using the very sparse movie rating data. Hence, an AUC of 0.69 is actually acceptable.
You have learned several classification metrics, and we will explore how to measure them properly and how to fine-tune our models in the next section.