Book Image

Python Feature Engineering Cookbook - Second Edition

By : Soledad Galli
Book Image

Python Feature Engineering Cookbook - Second Edition

By: Soledad Galli

Overview of this book

Feature engineering, the process of transforming variables and creating features, albeit time-consuming, ensures that your machine learning models perform seamlessly. This second edition of Python Feature Engineering Cookbook will take the struggle out of feature engineering by showing you how to use open source Python libraries to accelerate the process via a plethora of practical, hands-on recipes. This updated edition begins by addressing fundamental data challenges such as missing data and categorical values, before moving on to strategies for dealing with skewed distributions and outliers. The concluding chapters show you how to develop new features from various types of data, including text, time series, and relational databases. With the help of numerous open source Python libraries, you'll learn how to implement each feature engineering method in a performant, reproducible, and elegant manner. By the end of this Python book, you will have the tools and expertise needed to confidently build end-to-end and reproducible feature engineering pipelines that can be deployed into production.
Table of Contents (14 chapters)

What this book covers

Chapter 1, Imputing Missing Data, discusses various techniques to fill in missing values with estimates of missing data that are suitable for numerical and categorical features.

Chapter 2, Encoding Categorical Variables, introduces various widely used techniques to transform categorical variables into numbers. It starts by describing commonly used methods such as one-hot and ordinal encoding, then it moves on to domain-specific methods such as the weight of the evidence, and finally, it shows you how to encode variables that are highly cardinal.

Chapter 3, Transforming Numerical Variables, explain when we need to transform variables for use in machine learning models and then discusses common transformations and their suitability, based on variable characteristics.

Chapter 4, Performing Variable Discretization, introduces discretization and when it is useful, and then moves on to describe various discretization methods and their advantages and limitations. It covers the basic equal-with and equal-frequency discretization procedures, as well as discretization using decision trees and k-means.

Chapter 5, Working with Outliers, shows commonly used methods to remove outliers from the variables. You will learn how to detect outliers, how to cap variables at a given arbitrary value, and how to remove outliers.

Chapter 6, Extracting Features from Date and Time, describes how to create features from dates and time variables. It covers how to extract date and time components from features, as well as how to combine datetime variables and how to work with different time zones.

Chapter 7, Performing Feature Scaling, covers methods to put the variables on a similar scale. It discusses standardization, how to scale to maximum and minimum values, and how to perform more robust forms of variable scaling.

Chapter 8, Creating New Features, describes multiple methods with which we can combine existing variables to create new features. It shows the use of mathematical operations and also decision trees to create variables from two or more existing features.

Chapter 9, Extracting Features from Relational Data with Featuretools, introduces relational datasets and then moves on to explain how we can create features at different data aggregation levels, utilizing Featuretools. You will learn how to automatically create dozens of features from numerical and categorical variables, datetime, and text.

Chapter 10, Creating Features from Time Series with tsfresh, discusses how to automatically create several hundreds of features from time series data, for use in supervised classification or regression. You will learn how to automatically create and select relevant features from your time series with tsfresh.

Chapter 11, Extracting Features from Text Variables, covers simple methods to clean and extract value from short pieces of text. You will learn how to count words, sentences, characters, and lexical diversity. You will discover how to clean your text pieces and how to create feature matrices by counting words.