Book Image

Mastering Predictive Analytics with R

By : Rui Miguel Forte, Rui Miguel Forte
Book Image

Mastering Predictive Analytics with R

By: Rui Miguel Forte, Rui Miguel Forte

Overview of this book

Table of Contents (19 chapters)
Mastering Predictive Analytics with R
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

Performance metrics


In the previous section where we talked about the predictive modeling process, we delved into the importance of assessing a trained model's performance using training and test data sets. In this section, we will look at specific measures of performance that we will frequently encounter when describing the predictive accuracy of different models. It turns out that depending on the class of the problem, we will need to use slightly different ways of assessing performance. As we focus on supervised models in this book, we will look at how to assess regression models and classification models. For classification models, we will also discuss some additional metrics used for the binary classification task, which is a very important and frequently encountered type of problem.

Assessing regression models

In a regression scenario, let's recall that through our model we are building a function that is an estimate of a theoretical underlying target function f. The model's inputs are the values of our chosen input features. If we apply this function to every observation, xi, in our training data, which is labeled with the true value of the function, yi, we will obtain a set of pairs. To make sure we are clear on this last point, the first entry is the actual value of the output variable in our training data for the ith observation, and the second entry is the predicted value for this particular observation produced by using our model on the feature values for this observation.

If our model has fit the data well, both values will be very close to each other in the training set. If this is also true for our test set, then we consider that our model is likely to perform well for future unseen observations. To quantify the notion that the predicted and correct values are close to each other for all the observations in a data set, we define a measure known as the Mean Square Error (MSE), as follows:

Here, n is the total number of observations in the data set. Consequently, this equation tells us to first compute the squared difference between an output value and its predicted value for every observation, i, in the test set, and then take the average of all these values by summing them up and dividing by the number of observations. Thus, it should be clear why this measure is called the mean square error. The lower this number, the lower the average error between the actual value of the output variable in our observations and what we predict and therefore, the more accurate our model. We sometimes make reference to the Root Mean Square Error (RMSE), which is just the square root of the MSE and the Sum of Squared Error (SSE), which is similar to the MSE but without the normalization which results from dividing by the number of training examples, n. These quantities, when computed on the training data set, are valuable in the sense that a low number will indicate that we have trained a model sufficiently well. We know that we aren't expecting this to be zero in general, and we also cannot decide between models on the basis of these quantities because of the problem of overfitting. The key place to compute these measures is on the test data. In a majority of cases, a model's training data MSE (or equally, RMSE or SSE) will be lower than the corresponding measure computed on the test data. A model m1 that overfits the data compared to another model m2 can often be identified as such when the 1 model produces a lower training MSE but higher test MSE than model m2.

Assessing classification models

In regression models, the degree to which our predicted function incorrectly approximates an output, yi, for a particular observation, xi, is taken into account by the MSE. Specifically, large errors are squared and so a very large deviation on one data point can have a more significant impact than a few small deviations across more than one data point. It is precisely because we are dealing with a numerical output in regression that we can measure not only for which observations we aren't doing a good job at predicting, but also how far off we are.

For models that perform classification, we can again define an error rate, but here we can only talk about the number of misclassifications that were made by our model. Specifically, we have an error rate given by:

This measure uses the indicator function to return the value of 1 when the predicted class is not the same as the labeled class. Thus, the error rate is computed by counting the number of times the class of the output variable is incorrectly predicted, and dividing this count by the number of observations in the data set. In this way, we can see that the error rate is actually the percentage of misclassified observations made by our model. It should be noted that this measure treats all types of misclassifications as equal. If the cost of some misclassifications is higher than others, then this measure can be adjusted by adding in weights that multiply each misclassification by an amount proportional to its cost.

If we want to diagnose the greatest source of error in a regression problem, we tend to look at the points for which we have the largest error between our predicted value and the actual value. When doing classifications, it is often very useful to compute what is known as the confusion matrix. This is a matrix that shows all pairwise misclassifications that were made on our data. We shall now return to our iris species classification problem. In a previous section, we trained three kNN models. We'll now see how we can assess their performance. Like many classification models, kNN can return predictions either as final class labels or via a set of scores pertaining to each possible output class. Sometimes, as is the case here, these scores are actually probabilities that the model has assigned to every possible output. Regardless of whether the scores are actual probabilities, we can decide on which output label to pick on the basis of these scores, typically by simply choosing the label with the highest score. In R, the most common function to make model predictions is the predict() function, which we will use with our kNN models:

> knn_predictions_prob <- predict(knn_model, iris_test, type = "prob")
> tail(knn_predictions_prob, n = 3)
      setosa versicolor virginica
[28,]      0        0.0       1.0
[29,]      0        0.4       0.6
[30,]      0        0.0       1.0

In the kNN model, we can assign output scores as direct probabilities by computing the ratio of the nearest neighbors that belong to each output label. In the three test examples shown, the virginica species has unit probabilities in two of these, but only 60 percent probability for the remaining example. The other 40 percent belong to the versicolor species, so it seems that in the latter case, three out of five nearest neighbors were of the virginica species ,whereas the other two were of the versicolor species. It is clear that we should be more confident about the two former classifications than the latter. We'll now compute class predictions for the three models on the test data:

> knn_predictions <- predict(knn_model, iris_test, type = "class")
> knn_predictions_z <- predict(knn_model_z, iris_test_z, type = "class")
> knn_predictions_pca <- predict(knn_model_pca, iris_test_pca, type = "class")

We can use the postResample() function from the caret package to display test set accuracy metrics for our models:

> postResample(knn_predictions, iris_test_labels)
 Accuracy     Kappa 
0.9333333 0.9000000 
> postResample(knn_predictions_z, iris_test_labels)
 Accuracy     Kappa 
0.9666667 0.9500000 
> postResample(knn_predictions_pca, iris_test_labels)
Accuracy    Kappa 
    0.90     0.85

Here, accuracy is one minus the error rate and is thus the percentage of correctly classified observations. We can see that all the models perform very closely in terms of accuracy, with the model that uses a Z-score normalization prevailing. This difference is not significant given the small size of the test set. The Kappa statistic is defined as follows:

The Kappa statistic is designed to counterbalance the effect of random chance and takes values in the interval, [-1,1], where 1 indicates perfect accuracy, -1 indicates perfect inaccuracy, and 0 occurs when the accuracy is exactly what would be obtained by a random guesser. Note that a random guesser for a classification model guesses the most frequent class. In the case of our iris classification model, the three species are equally represented in the data, and so the expected accuracy is one third. The reader is encouraged to check that by using this value for the expected accuracy, we can obtain the observed values of Kappa statistic from the accuracy values.

We can also examine the specific misclassifications that our model makes, using a confusion matrix. This can simply be constructed by cross-tabulating the predictions with the correct output labels:

> table(knn_predictions, iris_test_labels) 
               iris_test_labels
knn_predictions setosa versicolor virginica
     setosa         10          0         0
     versicolor      0          9         1
     virginica       0          1         9

Tip

The caret package also has the very useful confusionMatrix() function, which automatically computes this table as well as several other performance metrics, the explanation of which can be found at http://topepo.github.io/caret/other.html.

In the preceding confusion matrix, we can see that the total number of correctly classified observations is 28, which is the sum of the numbers 10, 9, and 9 on the leading diagonal. The table shows us that the setosa species seems to be easier to predict with our model, as it is never confused with other species. The versicolor and virginica species, however, can be confused with each other and the model has misclassified one instance of each. We can therefore surmise that computing the confusion matrix serves as a useful exercise. Spotting class pairs that are frequently confused will guide us to improve our model, for example by looking for features that might help distinguish these classes.

Assessing binary classification models

A special case of classification known as a binary classification occurs when we have two classes. Here are some typical binary classification scenarios:

  • We want to classify incoming e-mails as spam or not spam using the e-mail's content and header

  • We want to classify a patient as having a disease or not using their symptoms and medical history

  • We want to classify a document from a large database of documents as being relevant to a search query, based on the words in the query and the words in the document

  • We want to classify a product from an assembly line as faulty or not

  • We want to predict whether a customer applying for credit at a bank will default on their payments, based on their credit score and financial situation

In a binary classification task, we usually refer to our two classes as the positive class and the negative class. By convention, the positive class corresponds to a special case that our model is trying to predict, and is often rarer than the negative class. From the preceding examples, we would use the positive class label for our spam e-mails, faulty assembly line products, defaulting customers, and so on. Now consider an example in the medical diagnosis domain, where we are trying to train a model to diagnose a disease that we know is only present in 1 in 10,000 of the population. We would assign the positive class to patients that have this disease. Notice that in such a scenario, the error rate alone is not an adequate measure of a model. For example, we can design the simplest of classifiers that will have an error rate of only 0.01 percent by predicting that every patient will be healthy, but such a classifier would be useless. We can come up with more useful metrics by examining the confusion matrix. Suppose that we had built a model to diagnose our rare disease and on a test sample of 100,000 patients, we obtained the following confusion matrix:

> table(actual,predicted)
          predicted
actual     negative positive
  negative    99900       78
  positive        9       13

The binary classification problem is so common that the cells of the binary confusion matrix have their own names. On the leading diagonal, which contains the correctly classified entries, we refer to the elements as the true negatives and true positives. In our case, we had 99900 true negatives and 13 true positives. When we misclassify an observation as belonging to the positive class when it actually belongs to the negative class, then we have a false positive, also known as a Type I error. A false negative or Type II error occurs when we misclassify a positive observation as belonging to the negative class. In our case, our model had 78 false positives and 9 false negatives.

We'll now introduce two very important measures in the context of binary classification, which are precision and recall. Precision is defined as the ratio of number of correctly predicted instances of the positive class to the total number of predicted instances of the positive class. Using the labels from the preceding binary confusion matrix, precision is given by:

Precision, thus, essentially measures how accurate we are in making predictions for the positive class. By definition, we can achieve 100 percent precision by never making any predictions for the positive class, as this way we are guaranteed to never make any mistakes. Recall, by contrast, is defined as the number of correct predictions for the positive class over all the members of the positive class in our data set. Once again, using the labels from the binary confusion matrix, we can see the definition of recall as:

Recall measures our ability to identify all the positive class members from our data set. We can easily achieve maximum recall by always predicting the positive class for all our data points. We will make a lot of mistakes, but we will never have any false negatives. Notice that precision and recall form a tradeoff in our model performance. At one end, if we don't predict the positive class for any of our data points, we will have 0 recall but maximum precision. At the other end, if all our data points are predicted as belonging to the positive class (which, remember, is usually a rare class), we will have maximum recall but extremely low precision. Put differently, trying to reduce the Type I error leads to increasing the Type II error and vice-versa. This inverse relationship is often plotted for a particular problem on a precision-recall curve. By using an appropriate threshold parameter, we can often tune the performance of our model in such a way that we achieve a specific point on this precision-recall curve that is appropriate for our circumstances. For example, in some problem domains, we tend to be biased toward having a higher recall than a higher precision, because of the high cost of misclassifying an observation from the positive class into the negative class. As we often want to describe the performance of a model using a single number, we define a measure known as the F1 score, which combines precision and recall. Specifically, the F1 score is defined as the harmonic mean between precision and recall:

The reader should verify that in our example confusion matrix, precision is 14.3 percent, recall is 59.1 percent, and the F1 score is 0.23.