Book Image

Python Feature Engineering Cookbook - Second Edition

By : Soledad Galli
Book Image

Python Feature Engineering Cookbook - Second Edition

By: Soledad Galli

Overview of this book

Feature engineering, the process of transforming variables and creating features, albeit time-consuming, ensures that your machine learning models perform seamlessly. This second edition of Python Feature Engineering Cookbook will take the struggle out of feature engineering by showing you how to use open source Python libraries to accelerate the process via a plethora of practical, hands-on recipes. This updated edition begins by addressing fundamental data challenges such as missing data and categorical values, before moving on to strategies for dealing with skewed distributions and outliers. The concluding chapters show you how to develop new features from various types of data, including text, time series, and relational databases. With the help of numerous open source Python libraries, you'll learn how to implement each feature engineering method in a performant, reproducible, and elegant manner. By the end of this Python book, you will have the tools and expertise needed to confidently build end-to-end and reproducible feature engineering pipelines that can be deployed into production.
Table of Contents (14 chapters)

Grouping rare or infrequent categories

Rare categories are those present only in a small fraction of the observations. There is no rule of thumb to determine how small a small fraction is, but typically, any value below 5% can be considered rare.

Infrequent labels often appear only on the train set or only on the test set, thus making the algorithms prone to overfitting or being unable to score an observation. In addition, when encoding categories to numbers, we only create mappings for those categories observed in the train set, so we won’t know how to encode new labels. To avoid these complications, we can group infrequent categories into a single category called Rare or Other.

In this recipe, we will group infrequent categories using pandas and Feature-engine.

How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

  1. Import the necessary Python libraries, functions, and classes:
    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from feature_engine.categorical_encoders import RareLabelEncoder
  2. Let’s load the dataset and divide it into train and test sets:
    data = pd.read_csv("credit_approval_uci.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  3. Let’s capture the fraction of observations per category in A7 in a variable:
    freqs = X_train["A7"].value_counts(normalize=True)

We can see the percentage of observations per category of A7, expressed as decimals, in the following output after executing print(freqs):

v	0.573499
h	0.209110
ff	0.084886
bb	0.080745
z	0.014493
dd	0.010352
j	0.010352
Missing	0.008282
n	0.006211
o	0.002070
Name: A7, dtype: float64

If we consider those labels present in less than 5% of the observations as rare, then z, dd, j, Missing, n, and o are rare categories.

  1. Let’s create a list containing the names of the categories present in more than 5% of the observations:
    frequent_cat = [
        x for x in freqs.loc[freqs > 0.05].index.values]

If we execute print(frequent_cat), we will see the frequent categories of A7:

['v', 'h', 'ff', 'bb'].
  1. Let’s replace rare labels – that is, those present in <= 5% of the observations – with the "Rare" string:
    X_train["A7"] = np.where(
        X_train["A7"].isin(frequent_cat),
        X_train["A7"], "Rare"
    )
    X_test["A7"] = np.where(
        X_test["A7"].isin(frequent_cat),
        X_test["A7"], "Rare"
        )
  2. Let’s determine the percentage of observations in the encoded variable:
    X_train["A7"].value_counts(normalize=True)

We can see that the infrequent labels have now been re-grouped into the Rare category:

v       0.573499 h       0.209110 ff      0.084886 bb      0.080745 Rare    0.051760 Name: A7, dtype: float64

Now, let’s group rare labels using Feature-engine. First, we must divide the dataset into train and test sets, as we did in step 2.

  1. Let’s create a rare label encoder that groups categories present in less than 5% of the observations, provided that the categorical variable has more than four distinct values:
    rare_encoder = RareLabelEncoder(tol=0.05, n_categories=4)
  2. Let’s fit the encoder so that it finds the categorical variables and then learns their most frequent categories:
    rare_encoder.fit(X_train)

Tip

Upon fitting, the transformer will raise warnings, indicating that many categorical variables have less than four categories, thus their values will not be grouped. The transformer just lets you know that this is happening.

We can display the frequent categories per variable by executing rare_encoder.encoder_dict_, as well as the variables that will be encoded by executing rare_encoder.variables_.

  1. Finally, let’s group rare labels in the train and test sets:
    X_train_enc = rare_encoder.transform(X_train)
    X_test_enc = rare_encoder.transform(X_test)

Now that we have grouped rare labels, we are ready to encode the categorical variables, as we’ve done in other recipes in this chapter.

How it works...

In this recipe, we grouped infrequent categories using pandas and Feature-engine.

We determined the fraction of observations per category of the A7 variable using pandas value_counts() by setting the normalize parameter to True. Using list comprehension, we captured the names of the variables present in more than 5% of the observations. Finally, using NumPy’s where(), we searched each row of A7, and if the observation was one of the frequent categories in the list, which we checked using the pandas isin() method, its value was kept; otherwise, its original value was replaced with "Rare".

We automated the preceding steps for multiple categorical variables using Feature-engine. For this, we used Feature-engine’s RareLabelEncoder(). By setting tol to 0.05, we retained categories present in more than 5% of the observations. By setting n_categories to 4, we only group rare categories in variables with more than four unique values. With the fit() method, the transformer identified the categorical variables and then learned and stored their frequent categories. With the transform() method, the transformer replaced infrequent categories with the "Rare" string.