In this chapter, you will be presented with best practices when it comes to training classifiers on big data. The new approach, exposed in the following pages, is both scalable and generic, making it perfect for datasets with a huge number of observations. Moreover, this approach can allow you to cope with streaming datasets—that is, datasets with observations transmitted on-the-fly and not all available at the same time. Furthermore, such an approach enhances precision, as more data is fed in during the training process.
With respect to the classic approach seen so far in the book, batch learning, this new approach is, not surprisingly, called online learning. The core of online learning is the divide et impera (divide and conquer) principle whereby each step of a mini-batch of the data serves as input to train and improve the classifier.
In this chapter, we will first focus on batch learning and its limitations, and then introduce online learning. Finally...