Chapter 4: Performing Variable Discretization | Python Feature Engineering Cookbook

Book Overview & Buying
Table Of Contents

Python Feature Engineering Cookbook - Third Edition

By : Galli

Buy this Book

Python Feature Engineering Cookbook

By: Galli

Buy this Book

Overview of this book

Streamline data preprocessing and feature engineering in your machine learning project with this third edition of the Python Feature Engineering Cookbook to make your data preparation more efficient. This guide addresses common challenges, such as imputing missing values and encoding categorical variables using practical solutions and open source Python libraries. You’ll learn advanced techniques for transforming numerical variables, discretizing variables, and dealing with outliers. Each chapter offers step-by-step instructions and real-world examples, helping you understand when and how to apply various transformations for well-prepared data. The book explores feature extraction from complex data types such as dates, times, and text. You’ll see how to create new features through mathematical operations and decision trees and use advanced tools like Featuretools and tsfresh to extract features from relational data and time series. By the end, you’ll be ready to build reproducible feature engineering pipelines that can be easily deployed into production, optimizing data preprocessing workflows and enhancing machine learning model performance.

Preface

Who this book is for

What this book covers

To get the most out of this book

Conventions used

Sections

Get in touch

Share Your Thoughts

Download a free PDF copy of this book

Free Chapter

Chapter 1: Imputing Missing Data

Technical requirements

Removing observations with missing data

Performing mean or median imputation

Imputing categorical variables

Replacing missing values with an arbitrary number

Finding extreme values for imputation

Marking imputed values

Implementing forward and backward fill

Carrying out interpolation

Performing multivariate imputation by chained equations

Estimating missing data with nearest neighbors

Chapter 2: Encoding Categorical Variables

Technical requirements

Creating binary variables through one-hot encoding

Performing one-hot encoding of frequent categories

Replacing categories with counts or the frequency of observations

Replacing categories with ordinal numbers

Performing ordinal encoding based on the target value

Implementing target mean encoding

Encoding with Weight of Evidence

Grouping rare or infrequent categories

Performing binary encoding

Chapter 3: Transforming Numerical Variables

Transforming variables with the logarithm function

Transforming variables with the reciprocal function

Using the square root to transform variables

Using power transformations

Performing Box-Cox transformations

Performing Yeo-Johnson transformations

Chapter 4: Performing Variable Discretization

Technical requirements

Performing equal-width discretization

Implementing equal-frequency discretization

Discretizing the variable into arbitrary intervals

Performing discretization with k-means clustering

Implementing feature binarization

Using decision trees for discretization

Chapter 5: Working with Outliers

Technical requirements

Visualizing outliers with boxplots and the inter-quartile proximity rule

Finding outliers using the mean and standard deviation

Using the median absolute deviation to find outliers

Removing outliers

Bringing outliers back within acceptable limits

Applying winsorization

Chapter 6: Extracting Features from Date and Time Variables

Technical requirements

Extracting features from dates with pandas

Extracting features from time with pandas

Capturing the elapsed time between datetime variables

Working with time in different time zones

Automating the datetime feature extraction with Feature-engine

Chapter 7: Performing Feature Scaling

Technical requirements

Standardizing the features

Scaling to the maximum and minimum values

Scaling with the median and quantiles

Performing mean normalization

Implementing maximum absolute scaling

Scaling to vector unit length

Chapter 8: Creating New Features

Technical requirements

Combining features with mathematical functions

Comparing features to reference variables

Performing polynomial expansion

Combining features with decision trees

Creating periodic features from cyclical variables

Creating spline features

Chapter 9: Extracting Features from Relational Data with Featuretools

Technical requirements

Setting up an entity set and creating features automatically

Creating features with general and cumulative operations

Combining numerical features

Extracting features from date and time

Extracting features from text

Creating features with aggregation primitives

Chapter 10: Creating Features from a Time Series with tsfresh

Technical requirements

Extracting hundreds of features automatically from a time series

Automatically creating and selecting predictive features from time-series data

Extracting different features from different time series

Creating a subset of features identified through feature selection

Embedding feature creation into a scikit-learn pipeline

Chapter 11: Extracting Features from Text Variables

Technical requirements

Counting characters, words, and vocabulary

Estimating text complexity by counting sentences

Creating features with bag-of-words and n-grams

Implementing term frequency-inverse document frequency

Cleaning and stemming text variables

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Python Feature Engineering Cookbook - Third Edition

By : Galli

Python Feature Engineering Cookbook

By: Galli

Overview of this book

Implementing feature binarization

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access