5.4 BALANCING THE TRAINING DATA SET
In some classification models, one of the target variable classes has a much lower relative frequency than the other classes. In such cases, balancing the training data set may be recommended. The purpose of balancing is to provide the classification algorithms with a rich selection of records for each category. In this way, the algorithms have a chance to learn about all types of records, not just those with a high frequency. For instance, suppose 1000 of 100,000 credit card transactions are fraudulent. A classification model could achieve 99% accuracy simply by predicting “non‐fraudulent” for every transaction. Clearly, this model is useless.
Instead, the analyst should balance the training set so that the proportion of fraudulent transactions is increased. This balancing is achieved through resampling a number of the fraudulent (rare) records.
Resampling is the process of sampling at random and with replacement from a data set...