Dealing with unbalanced datasets
At this point, I hope you have realized why data preparation is probably the longest part of our work. We have learned about data transformation, missing data values, and outliers, but the list of problems goes on. Don't worry – bear with me and let's master this topic together!
Another well-known problem with ML models, specifically with binary classification problems, is unbalanced classes. In a binary classification model, we say that a dataset is unbalanced when most of its observations belong to the same class (target variable).
This is very common in fraud identification systems, for example, where most of the events belong to a regular operation, while a very small number of events belong to a fraudulent operation. In this case, we can also say that fraud is a rare event.
There is no strong rule for defining whether a dataset is unbalanced or not, in the sense of it being necessary to worry about it. Most challenge problems...