Book Image

The Data Science Workshop - Second Edition

By : Anthony So, Thomas V. Joseph, Robert Thas John, Andrew Worsley, Dr. Samuel Asare
5 (1)
Book Image

The Data Science Workshop - Second Edition

5 (1)
By: Anthony So, Thomas V. Joseph, Robert Thas John, Andrew Worsley, Dr. Samuel Asare

Overview of this book

Where there’s data, there’s insight. With so much data being generated, there is immense scope to extract meaningful information that’ll boost business productivity and profitability. By learning to convert raw data into game-changing insights, you’ll open new career paths and opportunities. The Data Science Workshop begins by introducing different types of projects and showing you how to incorporate machine learning algorithms in them. You’ll learn to select a relevant metric and even assess the performance of your model. To tune the hyperparameters of an algorithm and improve its accuracy, you’ll get hands-on with approaches such as grid search and random search. Next, you’ll learn dimensionality reduction techniques to easily handle many variables at once, before exploring how to use model ensembling techniques and create new features to enhance model performance. In a bid to help you automatically create new features that improve your model, the book demonstrates how to use the automated feature engineering tool. You’ll also understand how to use the orchestration and scheduling workflow to deploy machine learning models in batch. By the end of this book, you’ll have the skills to start working on data science projects confidently. By the end of this book, you’ll have the skills to start working on data science projects confidently.
Table of Contents (16 chapters)
Preface
12
12. Feature Engineering

RandomForest Variable Importance

Chapter 4, Multiclass Classification with RandomForest, introduced you to a very powerful tree-based algorithm: RandomForest. It is one of the most popular algorithms in the industry, not only because it achieves very good results in terms of prediction but also for the fact that it provides several tools for interpreting it, such as variable importance.

Remember from Chapter 4, Multiclass Classification with RandomForest, that RandomForest builds multiple independent trees and then averages their results to make a final prediction. We also learned that it creates nodes in each tree to find the best split that will clearly separate the observations into two groups. RandomForest uses different measures to find the best split. In sklearn, you can either use the Gini or Entropy measure for the classification task and MSE or MAE for regression. Without going into the details of each of them, these measures calculate the level of impurity of a given split...