Book Image

Python Feature Engineering Cookbook - Second Edition

By : Soledad Galli
Book Image

Python Feature Engineering Cookbook - Second Edition

By: Soledad Galli

Overview of this book

Feature engineering, the process of transforming variables and creating features, albeit time-consuming, ensures that your machine learning models perform seamlessly. This second edition of Python Feature Engineering Cookbook will take the struggle out of feature engineering by showing you how to use open source Python libraries to accelerate the process via a plethora of practical, hands-on recipes. This updated edition begins by addressing fundamental data challenges such as missing data and categorical values, before moving on to strategies for dealing with skewed distributions and outliers. The concluding chapters show you how to develop new features from various types of data, including text, time series, and relational databases. With the help of numerous open source Python libraries, you'll learn how to implement each feature engineering method in a performant, reproducible, and elegant manner. By the end of this Python book, you will have the tools and expertise needed to confidently build end-to-end and reproducible feature engineering pipelines that can be deployed into production.
Table of Contents (14 chapters)

Combining features with mathematical functions

New features can be created by combining existing variables with mathematical and statistical functions. At the beginning of this chapter, we mentioned that we can calculate the total debt by summing up the debt across individual financial products, as follows:

Total debt = car loan debt + credit card debt + mortgage debt

We can also derive other insightful features using alternative statistical operations. For example, we can determine the maximum debt of a customer across financial products or the average time users have spent on a web page:

maximum debt = max(car loan balance, credit card balance, mortgage balance)

average time on page = mean(time spent user 1, time spent user 2, time spent user 3)

We can, in principle, use any mathematical or statistical operation to create new features, such as the product, mean, standard deviation, or maximum or minimum values, to name a few. In this recipe, we will implement these...