Book Image

Python Feature Engineering Cookbook - Second Edition

By : Soledad Galli
Book Image

Python Feature Engineering Cookbook - Second Edition

By: Soledad Galli

Overview of this book

Feature engineering, the process of transforming variables and creating features, albeit time-consuming, ensures that your machine learning models perform seamlessly. This second edition of Python Feature Engineering Cookbook will take the struggle out of feature engineering by showing you how to use open source Python libraries to accelerate the process via a plethora of practical, hands-on recipes. This updated edition begins by addressing fundamental data challenges such as missing data and categorical values, before moving on to strategies for dealing with skewed distributions and outliers. The concluding chapters show you how to develop new features from various types of data, including text, time series, and relational databases. With the help of numerous open source Python libraries, you'll learn how to implement each feature engineering method in a performant, reproducible, and elegant manner. By the end of this Python book, you will have the tools and expertise needed to confidently build end-to-end and reproducible feature engineering pipelines that can be deployed into production.
Table of Contents (14 chapters)

Performing one-hot encoding of frequent categories

One-hot encoding represents each variable’s category with a binary variable. Hence, one-hot encoding of highly cardinal variables or datasets with multiple categorical features can expand the feature space dramatically. This, in turn, may increase the computational cost of using machine learning models or deteriorate their performance. To reduce the number of binary variables, we can perform one-hot encoding of the most frequent categories. One-hot encoding the top categories is equivalent to treating the remaining, less frequent categories as a single, unique category.

In this recipe, we will implement one-hot encoding of the most popular categories using pandas and Feature-engine.

How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

  1. Import the required Python libraries, functions, and classes:
    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split
    from feature_engine.encoding import OneHotEncoder
  2. Let’s load the dataset and divide it into train and test sets:
    data = pd.read_csv("credit_approval_uci.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(
            labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )

Tip

The most frequent categories need to be determined in the train set. This is to avoid data leakage.

  1. Let’s inspect the unique categories of the A6 variable:
    X_train["A6"].unique()

The unique values of A6 are displayed in the following output:

array(['c', 'q', 'w', 'ff', 'm', 'i', 'e', 'cc', 'x', 'd', 'k', 'j', 'Missing, 'aa', 'r'], dtype=object)
  1. Let’s count the number of observations per category of A6, sort them in decreasing order, and then display the five most frequent categories:
    X_train["A6"].value_counts().sort_values(
        ascending=False).head(5)

We can see the five most frequent categories and the number of observations per category in the following output:

c     93 q     56 w     48 i     41 ff    38
Name: A6, dtype: int64
  1. Now, let’s capture the most frequent categories of A6 in a list by using the code in step 4 inside a list comprehension:
    top_5 = [
        x for x in X_train["A6"].value_counts().sort_values(
            ascending=False).head(5).index
    ]
  2. Now, let’s add a binary variable per top category to the train and test sets:
    for label in top_5:
        X_train[f"A6_{label}"] = np.where(
            X_train["A6"] ==label, 1, 0)
        X_test[f"A6_{label}"] = np.where(
            X_test["A6"] ==label, 1, 0)
  3. Let’s display the top 10 rows of the original and encoded variable, A6, in the train set:
    X_train[["A6"] + [f"A6_{label}" for label in top_5]].head(10)

In the output of step 7, we can see the A6 variable, followed by the binary variables:

    A6  A6_c  A6_q  A6_w  A6_i  A6_ff 596   c     1     0     0     0      0 303   q     0     1     0     0      0 204   w     0     0     1     0      0 351  ff     0     0     0     0      1 118   m     0     0     0     0      0 247   q     0     1     0     0      0 652   i     0     0     0     1      0 513   e     0     0     0     0      0 230  cc     0     0     0     0      0 250   e     0     0     0     0      0

We can automate one-hot encoding of frequent categories with Feature-engine. First, let’s load and divide the dataset, as we did in step 2.

  1. Let’s set up the one-hot encoder to encode the five most frequent categories of the A6 and A7 variables:
    ohe_enc = OneHotEncoder(
        top_categories=5,
        variables=["A6", "A7"]
    )

Tip

Feature-engine’s OneHotEncoder() will encode all categorical variables in the dataset by default unless we specify the variables to encode, as we did in step 8.

  1. Let’s fit the encoder to the train set so that it learns and stores the most frequent categories of A6 and A7:
    ohe_enc.fit(X_train)

Note

The number of frequent categories to encode is arbitrarily determined by the user.

  1. Finally, let’s encode A6 and A7 in the train and test sets:
    X_train_enc = ohe_enc.transform(X_train)
    X_test_enc = ohe_enc.transform(X_test)

You can view the new binary variables in the DataFrame by executing X_train_enc.head(). You can also find the top five categories learned by the encoder by executing ohe_enc.encoder_dict_.

Note

Feature-engine replaces the original variable with the binary ones returned by one-hot encoding, leaving the dataset ready to use in machine learning.

How it works...

In this recipe, we performed one-hot encoding of the five most popular categories using NumPy and Feature-engine.

In the first part of this recipe, we worked with the A6 categorical variable. We inspected its unique categories with pandas unique(). Next, we counted the number of observations per category using pandas value_counts(),which returned a pandas series with the categories as the index and the number of observations as values. Next, we sorted the categories from the one with the most to the one with the least observations using pandas sort_values(). Next, we reduced the series to the five most popular categories by using pandas head(). Then, we used this series in a list comprehension to capture the name of the most frequent categories. After that, we looped over each category, and with NumPy’s where() method, we created binary variables by placing a value of 1 if the observation showed the category, or 0 otherwise.

To perform a one-hot encoding of the five most popular categories of the A6 and A7 variables with Feature-engine, we used OneHotEncoder(), indicating 5 in the top_categories argument, and passing the variable names in a list to the variables argument. With fit(), the encoder learned the top categories from the train set and stored them in its encoder_dict_ attribute. Then, with transform(), OneHotEncoder() replaced the original variables with the set of binary ones.

There’s more...

This recipe is based on the winning solution of the KDD 2009 cup, Winning the KDD Cup Orange Challenge with Ensemble Selection (http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf), where the authors limited one-hot encoding to the 10 most frequent categories of each variable.