Machine Learning for Imbalanced Data

By : Kumar Abhishek, Dr. Mounir Abdelaziz

Machine Learning for Imbalanced Data

By: Kumar Abhishek, Dr. Mounir Abdelaziz

Overview of this book

As machine learning practitioners, we often encounter imbalanced datasets in which one class has considerably fewer instances than the other. Many machine learning algorithms assume an equilibrium between majority and minority classes, leading to suboptimal performance on imbalanced data. This comprehensive guide helps you address this class imbalance to significantly improve model performance. Machine Learning for Imbalanced Data begins by introducing you to the challenges posed by imbalanced datasets and the importance of addressing these issues. It then guides you through techniques that enhance the performance of classical machine learning models when using imbalanced data, including various sampling and cost-sensitive learning methods. As you progress, you’ll delve into similar and more advanced techniques for deep learning models, employing PyTorch as the primary framework. Throughout the book, hands-on examples will provide working and reproducible code that’ll demonstrate the practical implementation of each technique. By the end of this book, you’ll be adept at identifying and addressing class imbalances and confidently applying various techniques, including sampling, cost-sensitive techniques, and threshold adjustment, while using traditional machine learning or deep learning models.

Preface

Who this book is for

What this book covers

📌 Usage of techniques – In production tips

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Share Your Thoughts

Download a free PDF copy of this book

Free Chapter

Chapter 1: Introduction to Data Imbalance in Machine Learning

Technical requirements

Introduction to imbalanced datasets

Machine learning 101

Types of dataset and splits

Common evaluation metrics

Challenges and considerations when dealing with imbalanced data

When can we have an imbalance in datasets?

Why can imbalanced data be a challenge?

When to not worry about data imbalance

Introduction to the imbalanced-learn library

General rules to follow

Summary

Questions

References

Chapter 2: Oversampling Methods

Technical requirements

What is oversampling?

Random oversampling

SMOTE

SMOTE variants

ADASYN

Model performance comparison of various oversampling methods

Guidance for using various oversampling techniques

Oversampling in multi-class classification

Summary

Exercises

References

Chapter 3: Undersampling Methods

Technical requirements

Introducing undersampling

When to avoid undersampling the majority class

Removing examples uniformly

Strategies for removing noisy observations

Strategies for removing easy observations

Summary

Exercises

References

Chapter 4: Ensemble Methods

Technical requirements

Bagging techniques for imbalanced data

Boosting techniques for imbalanced data

Ensemble of ensembles

Model performance comparison

Summary

Questions

References

Chapter 5: Cost-Sensitive Learning

Technical requirements

The concept of Cost-Sensitive Learning

Understanding costs in practice

Cost-Sensitive Learning for logistic regression

Cost-Sensitive Learning for decision trees

Cost-Sensitive Learning using scikit-learn and XGBoost models

MetaCost – making any classification model cost-sensitive

Threshold adjustment

Summary

Questions

References

Chapter 6: Data Imbalance in Deep Learning

Technical requirements

A brief introduction to deep learning

Data imbalance in deep learning

Overview of deep learning techniques to handle data imbalance

Multi-label classification

Summary

Questions

References

Chapter 7: Data-Level Deep Learning Methods

Technical requirements

Preparing the data

Sampling techniques for deep learning models

Data-level techniques for text classification

Discussion of other data-level deep learning methods and their key ideas

Summary

Questions

References

Chapter 8: Algorithm-Level Deep Learning Techniques

Technical requirements

Motivation for algorithm-level techniques

Weighting techniques

Explicit loss function modification

Discussing other algorithm-based techniques

Summary

Questions

References

Chapter 9: Hybrid Deep Learning Methods

Technical requirements

Using graph machine learning for imbalanced data

Hard example mining

Minority class incremental rectification

Summary

Questions

References

Chapter 10: Model Calibration

Technical requirements

Introduction to model calibration

The influence of data balancing techniques on model calibration

Plotting calibration curves for a model trained on a real-world dataset

Model calibration techniques

The impact of calibration on a model’s performance

Summary

Questions

References

Assessments

Chapter 1 – Introduction to Data Imbalance in Machine Learning

Chapter 2 – Oversampling Methods

Chapter 3 – Undersampling Methods

Chapter 4 – Ensemble Methods

Chapter 5 – Cost-Sensitive Learning

Chapter 6 – Data Imbalance in Deep Learning

Chapter 7 – Data-Level Deep Learning Methods

Chapter 8 – Algorithm-Level Deep Learning Techniques

Chapter 9 – Hybrid Deep Learning Methods

Chapter 10 – Model Calibration

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Appendix: Machine Learning Pipeline in Production

Machine learning training pipeline

Inferencing (online or batch)

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Preface

Hello and welcome! Machine Learning (ML) enables computers to learn from data using algorithms to make informed decisions, automate tasks, and extract valuable insights. One particular aspect that often garners attention is imbalanced data, where certain classes may have considerably fewer samples than others.

This book provides an in-depth guide to understanding and navigating the intricacies of skewed data. You will gain insights into best practices for managing imbalanced datasets in ML contexts.

While imbalanced data can present challenges, it’s important to understand that the techniques to address this imbalance are not universally applicable. Their relevance and necessity depend on various factors such as the domain, the data distribution, the performance metrics you’re optimizing, and the business objectives. Before adopting any techniques, it’s essential to establish a baseline. Even if you don’t currently face issues with imbalanced data, it can be beneficial to be aware of the challenges and solutions discussed in this book. Familiarizing yourself with these techniques will provide you with a comprehensive toolkit, preparing you for scenarios that you may not yet know you’ll encounter. If you do find that model performance is lacking, especially for underrepresented (minority) classes, the insights and strategies covered in the book can be instrumental in guiding effective improvements.

As the domains of ML and artificial intelligence continue to grow, there will be an increasing demand for professionals who can adeptly handle various data challenges, including imbalance. This book aims to equip you with the knowledge and tools to be one of those sought-after experts.

Machine Learning for Imbalanced Data

By : Kumar Abhishek, Dr. Mounir Abdelaziz

Machine Learning for Imbalanced Data

By: Kumar Abhishek, Dr. Mounir Abdelaziz

Overview of this book

Related Content you might be interested in

Current Title:

Machine Learning for Imbalanced Data

Practical Guide to Applied Conformal Prediction in Python

Data-Centric Machine Learning with Python

MATLAB for Machine Learning

Preface