Book Image

The Supervised Learning Workshop - Second Edition

By : Blaine Bateman, Ashish Ranjan Jha, Benjamin Johnston, Ishita Mathur
Book Image

The Supervised Learning Workshop - Second Edition

By: Blaine Bateman, Ashish Ranjan Jha, Benjamin Johnston, Ishita Mathur

Overview of this book

Would you like to understand how and why machine learning techniques and data analytics are spearheading enterprises globally? From analyzing bioinformatics to predicting climate change, machine learning plays an increasingly pivotal role in our society. Although the real-world applications may seem complex, this book simplifies supervised learning for beginners with a step-by-step interactive approach. Working with real-time datasets, you’ll learn how supervised learning, when used with Python, can produce efficient predictive models. Starting with the fundamentals of supervised learning, you’ll quickly move to understand how to automate manual tasks and the process of assessing date using Jupyter and Python libraries like pandas. Next, you’ll use data exploration and visualization techniques to develop powerful supervised learning models, before understanding how to distinguish variables and represent their relationships using scatter plots, heatmaps, and box plots. After using regression and classification models on real-time datasets to predict future outcomes, you’ll grasp advanced ensemble techniques such as boosting and random forests. Finally, you’ll learn the importance of model evaluation in supervised learning and study metrics to evaluate regression and classification tasks. By the end of this book, you’ll have the skills you need to work on your real-life supervised learning Python projects.
Table of Contents (9 chapters)

Bagging

The term bagging is derived from a technique called bootstrap aggregation. In order to implement a successful predictive model, it's important to know in what situation we could benefit from using bootstrapping methods to build ensemble models. Such models are used extensively both in industry as well as academia.

One such application would be that these models can be used for the quality assessment of Wikipedia articles. Features such as article_length, number_of_references, number_of_headings, and number_of_images are used to build a classifier that classifies Wikipedia articles into low- or high-quality articles. Out of the several models that were tried for this task, the random forest model – a well-known bagging-based ensemble classifier that we will discuss in our next section – outperforms all other models such as SVM, logistic regression, and even neural networks, with the best precision and recall scores of 87.3% and 87.2%, respectively. This...