Book Image

Ensemble Machine Learning Cookbook

By : Dipayan Sarkar, Vijayalakshmi Natarajan
Book Image

Ensemble Machine Learning Cookbook

By: Dipayan Sarkar, Vijayalakshmi Natarajan

Overview of this book

Ensemble modeling is an approach used to improve the performance of machine learning models. It combines two or more similar or dissimilar machine learning algorithms to deliver superior intellectual powers. This book will help you to implement popular machine learning algorithms to cover different paradigms of ensemble machine learning such as boosting, bagging, and stacking. The Ensemble Machine Learning Cookbook will start by getting you acquainted with the basics of ensemble techniques and exploratory data analysis. You'll then learn to implement tasks related to statistical and machine learning algorithms to understand the ensemble of multiple heterogeneous algorithms. It will also ensure that you don't miss out on key topics, such as like resampling methods. As you progress, you’ll get a better understanding of bagging, boosting, stacking, and working with the Random Forest algorithm using real-world examples. The book will highlight how these ensemble methods use multiple models to improve machine learning results, as compared to a single model. In the concluding chapters, you'll delve into advanced ensemble models using neural networks, natural language processing, and more. You’ll also be able to implement models such as fraud detection, text categorization, and sentiment analysis. By the end of this book, you'll be able to harness ensemble techniques and the working mechanisms of machine learning algorithms to build intelligent models using individual recipes.
Table of Contents (14 chapters)

Spam filtering using an ensemble of heterogeneous algorithms

We will use the SMS Spam Collection dataset from the UCI ML repository to create a spam classifier. Using the spam classifier, we can estimate the polarity of these messages. We can use various classifiers to classify the messages either as spam or ham.

In this example, we opt for algorithms such as Naive Bayes, random forest, and support vector machines to train our models.

We prepare our data using various data-cleaning and preparation mechanisms. To preprocess our data, we will perform the following sequence:

  1. Convert all text to lowercase
  2. Remove punctuation
  3. Remove stop words
  4. Perform stemming
  5. Tokenize the data

We also process our data using term frequency-inverse data frequency (TF-IDF), which tells us how often a word appears in a message or a document. TF is calculated as:

TF = No. of times a word appears in a...