Python Feature Engineering Cookbook - Second Edition

By : Soledad Galli

Python Feature Engineering Cookbook - Second Edition

By: Soledad Galli

Overview of this book

Feature engineering, the process of transforming variables and creating features, albeit time-consuming, ensures that your machine learning models perform seamlessly. This second edition of Python Feature Engineering Cookbook will take the struggle out of feature engineering by showing you how to use open source Python libraries to accelerate the process via a plethora of practical, hands-on recipes. This updated edition begins by addressing fundamental data challenges such as missing data and categorical values, before moving on to strategies for dealing with skewed distributions and outliers. The concluding chapters show you how to develop new features from various types of data, including text, time series, and relational databases. With the help of numerous open source Python libraries, you'll learn how to implement each feature engineering method in a performant, reproducible, and elegant manner. By the end of this Python book, you will have the tools and expertise needed to confidently build end-to-end and reproducible feature engineering pipelines that can be deployed into production.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the color images

Reviews

Download a Free PDF copy of this book

Chapter 1: Imputing Missing Data

Technical requirements

Removing observations with missing data

Performing mean or median imputation

Imputing categorical variables

Replacing missing values with an arbitrary number

Finding extreme values for imputation

Marking imputed values

Performing multivariate imputation by chained equations

Estimating missing data with nearest neighbors

Free Chapter

Chapter 2: Encoding Categorical Variables

Technical requirements

Creating binary variables through one-hot encoding

Performing one-hot encoding of frequent categories

Replacing categories with counts or the frequency of observations

Replacing categories with ordinal numbers

Performing ordinal encoding based on the target value

Implementing target mean encoding

Encoding with the Weight of Evidence

Grouping rare or infrequent categories

Performing binary encoding

Chapter 3: Transforming Numerical Variables

Transforming variables with the logarithm function

Transforming variables with the reciprocal function

Using the square root to transform variables

Using power transformations

Performing Box-Cox transformation

Performing Yeo-Johnson transformation

Chapter 4: Performing Variable Discretization

Technical requirements

Performing equal-width discretization

Implementing equal-frequency discretization

Discretizing the variable into arbitrary intervals

Performing discretization with k-means clustering

Implementing feature binarization

Using decision trees for discretization

Chapter 5: Working with Outliers

Technical requirements

Visualizing outliers with boxplots

Finding outliers using the mean and standard deviation

Finding outliers with the interquartile range proximity rule

Removing outliers

Capping or censoring outliers

Capping outliers using quantiles

Chapter 6: Extracting Features from Date and Time Variables

Technical requirements

Extracting features from dates with pandas

Extracting features from time with pandas

Capturing the elapsed time between datetime variables

Working with time in different time zones

Automating feature extraction with Feature-engine

Chapter 7: Performing Feature Scaling

Technical requirements

Standardizing the features

Scaling to the maximum and minimum values

Scaling with the median and quantiles

Performing mean normalization

Implementing maximum absolute scaling

Scaling to vector unit length

Chapter 8: Creating New Features

Technical requirements

Combining features with mathematical functions

Comparing features to reference variables

Performing polynomial expansion

Combining features with decision trees

Creating periodic features from cyclical variables

Creating spline features

Chapter 9: Extracting Features from Relational Data with Featuretools

Technical requirements

Setting up an entity set and creating features automatically

Creating features with general and cumulative operations

Combining numerical features

Extracting features from date and time

Extracting features from text

Creating features with aggregation primitives

Chapter 10: Creating Features from a Time Series with tsfresh

Technical requirements

Extracting features automatically from a time series

Creating and selecting features for a time series

Tailoring feature creation to different time series

Creating pre-selected features

Embedding feature creation in a scikit-learn pipeline

Chapter 11: Extracting Features from Text Variables

Technical requirements

Counting characters, words, and vocabulary

Estimating text complexity by counting sentences

Creating features with bag-of-words and n-grams

Implementing term frequency-inverse document frequency

Cleaning and stemming text variables

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a Free PDF copy of this book

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Grouping rare or infrequent categories

Rare categories are those present only in a small fraction of the observations. There is no rule of thumb to determine how small a small fraction is, but typically, any value below 5% can be considered rare.

Infrequent labels often appear only on the train set or only on the test set, thus making the algorithms prone to overfitting or being unable to score an observation. In addition, when encoding categories to numbers, we only create mappings for those categories observed in the train set, so we won’t know how to encode new labels. To avoid these complications, we can group infrequent categories into a single category called Rare or Other.

In this recipe, we will group infrequent categories using pandas and Feature-engine.

How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

Import the necessary Python libraries, functions, and classes:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.categorical_encoders import RareLabelEncoder

Let’s load the dataset and divide it into train and test sets:

data = pd.read_csv("credit_approval_uci.csv")
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=["target"], axis=1),
    data["target"],
    test_size=0.3,
    random_state=0,
)

Let’s capture the fraction of observations per category in A7 in a variable:
```
freqs = X_train["A7"].value_counts(normalize=True)
```

We can see the percentage of observations per category of A7, expressed as decimals, in the following output after executing print(freqs):

v	0.573499
h	0.209110
ff	0.084886
bb	0.080745
z	0.014493
dd	0.010352
j	0.010352
Missing	0.008282
n	0.006211
o	0.002070
Name: A7, dtype: float64

If we consider those labels present in less than 5% of the observations as rare, then z, dd, j, Missing, n, and o are rare categories.

Let’s create a list containing the names of the categories present in more than 5% of the observations:
```
frequent_cat = [
    x for x in freqs.loc[freqs > 0.05].index.values]
```

If we execute print(frequent_cat), we will see the frequent categories of A7:

['v', 'h', 'ff', 'bb'].

Let’s replace rare labels – that is, those present in <= 5% of the observations – with the "Rare" string:

X_train["A7"] = np.where(
    X_train["A7"].isin(frequent_cat),
    X_train["A7"], "Rare"
)
X_test["A7"] = np.where(
    X_test["A7"].isin(frequent_cat),
    X_test["A7"], "Rare"
    )

Let’s determine the percentage of observations in the encoded variable:
```
X_train["A7"].value_counts(normalize=True)
```

We can see that the infrequent labels have now been re-grouped into the Rare category:

v       0.573499 h       0.209110 ff      0.084886 bb      0.080745 Rare    0.051760 Name: A7, dtype: float64

Now, let’s group rare labels using Feature-engine. First, we must divide the dataset into train and test sets, as we did in step 2.

Let’s create a rare label encoder that groups categories present in less than 5% of the observations, provided that the categorical variable has more than four distinct values:
```
rare_encoder = RareLabelEncoder(tol=0.05, n_categories=4)
```
Let’s fit the encoder so that it finds the categorical variables and then learns their most frequent categories:
```
rare_encoder.fit(X_train)
```

Tip

Upon fitting, the transformer will raise warnings, indicating that many categorical variables have less than four categories, thus their values will not be grouped. The transformer just lets you know that this is happening.

We can display the frequent categories per variable by executing rare_encoder.encoder_dict_, as well as the variables that will be encoded by executing rare_encoder.variables_.

Finally, let’s group rare labels in the train and test sets:

X_train_enc = rare_encoder.transform(X_train)
X_test_enc = rare_encoder.transform(X_test)

Now that we have grouped rare labels, we are ready to encode the categorical variables, as we’ve done in other recipes in this chapter.

How it works...

In this recipe, we grouped infrequent categories using pandas and Feature-engine.

We determined the fraction of observations per category of the A7 variable using pandas value_counts() by setting the normalize parameter to True. Using list comprehension, we captured the names of the variables present in more than 5% of the observations. Finally, using NumPy’s where(), we searched each row of A7, and if the observation was one of the frequent categories in the list, which we checked using the pandas isin() method, its value was kept; otherwise, its original value was replaced with "Rare".

We automated the preceding steps for multiple categorical variables using Feature-engine. For this, we used Feature-engine’s RareLabelEncoder(). By setting tol to 0.05, we retained categories present in more than 5% of the observations. By setting n_categories to 4, we only group rare categories in variables with more than four unique values. With the fit() method, the transformer identified the categorical variables and then learned and stored their frequent categories. With the transform() method, the transformer replaced infrequent categories with the "Rare" string.

Python Feature Engineering Cookbook - Second Edition

By : Soledad Galli

Python Feature Engineering Cookbook - Second Edition

By: Soledad Galli

Overview of this book

Related Content you might be interested in

Current Title:

Python Feature Engineering Cookbook - Second Edition

Data Preprocessing with Python for Absolute Beginners

Data Cleaning and Exploration with Machine Learning

Hands-On Automated Machine Learning

Grouping rare or infrequent categories

How to do it...

How it works...