Book Image

Python Feature Engineering Cookbook - Second Edition

By : Soledad Galli
Book Image

Python Feature Engineering Cookbook - Second Edition

By: Soledad Galli

Overview of this book

Feature engineering, the process of transforming variables and creating features, albeit time-consuming, ensures that your machine learning models perform seamlessly. This second edition of Python Feature Engineering Cookbook will take the struggle out of feature engineering by showing you how to use open source Python libraries to accelerate the process via a plethora of practical, hands-on recipes. This updated edition begins by addressing fundamental data challenges such as missing data and categorical values, before moving on to strategies for dealing with skewed distributions and outliers. The concluding chapters show you how to develop new features from various types of data, including text, time series, and relational databases. With the help of numerous open source Python libraries, you'll learn how to implement each feature engineering method in a performant, reproducible, and elegant manner. By the end of this Python book, you will have the tools and expertise needed to confidently build end-to-end and reproducible feature engineering pipelines that can be deployed into production.
Table of Contents (14 chapters)

Performing ordinal encoding based on the target value

In the previous recipe, we replaced categories with integers, which were assigned arbitrarily. We can also assign integers to the categories given the target values. To do this, first, we must calculate the mean value of the target per category. Next, we must order the categories from the one with the lowest to the one with the highest target mean value. Finally, we must assign digits to the ordered categories, starting with 0 to the first category up to k-1 to the last category, where k is the number of distinct categories.

This encoding method creates a monotonic relationship between the categorical variable and the response and therefore makes the variables more adequate for use in linear models.

In this recipe, we will encode categories while following the target value using pandas and Feature-engine.

How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

  1. Import the required Python libraries, functions, and classes:
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split
  2. Let’s load the dataset and divide it into train and test sets:
    data = pd.read_csv("credit_approval_uci.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  3. Let’s determine the mean target value per category in A7, then sort the categories from that with the lowest to that with the highest target value:
    y_train.groupby(X_train["A7"]).mean().sort_values()

The following is the output of the preceding command:

A7
o          0.000000
ff         0.146341
j          0.200000
dd         0.400000
v          0.418773
bb         0.512821
h          0.603960
n          0.666667
z          0.714286
Missing    1.000000
Name: target, dtype: float64
  1. Now, let’s repeat the computation in step 3, but this time, let’s retain the ordered category names:
    ordered_labels = y_train.groupby(
        X_train["A7"]).mean().sort_values().index

To display the output of the preceding command, we can execute print(ordered_labels):

Index(['o', 'ff', 'j', 'dd', 'v', 'bb', 'h', 'n', 'z', 'Missing'], dtype='object', name='A7')
  1. Let’s create a dictionary of category-to-integer pairs, using the ordered list we created in step 4:
    ordinal_mapping = {
        k: i for i, k in enumerate(
            ordered_labels, 0)
    }

We can visualize the result of the preceding code by executing print(ordinal_mapping):

{'o': 0, 'ff': 1, 'j': 2, 'dd': 3, 'v': 4, 'bb': 5, 'h': 6, 'n': 7, 'z': 8, 'Missing': 9}
  1. Let’s use the dictionary we created in step 5 to replace the categories in A7 in the train and test sets, returning the encoded features as new columns:
    X_train["A7_enc"] = X_train["A7"].map(ordinal_mapping)
    X_test["A7_enc"] = X_test["A7"].map(ordinal_mapping)

Tip

Note that if the test set contains a category not present in the train set, the preceding code will introduce np.nan.

To better understand the monotonic relationship concept, let’s plot the relationship of the categories of the A7 variable with the target before and after the encoding.

  1. Let’s plot the mean target response per category of the A7 variable:
    y_train.groupby(X_train["A7"]).mean().plot()
    plt.title("Relationship between A7 and the target")
    plt.ylabel("Mean of target")
    plt.show()

We can see the non-monotonic relationship between categories of A7 and the target in the following plot:

Figure 2.7 – Relationship between the categories of A7 and the target

Figure 2.7 – Relationship between the categories of A7 and the target

  1. Let’s plot the mean target value per category in the encoded variable:
    y_train.groupby(X_train["A7_enc"]).mean().plot()
    plt.title("Relationship between A7 and the target")
    plt.ylabel("Mean of target")
    plt.show()

The encoded variable shows a monotonic relationship with the target – the higher the mean target value, the higher the digit assigned to the category:

Figure 2.8 – Relationship between A7 and the target after the encoding

Figure 2.8 – Relationship between A7 and the target after the encoding

Now, let’s perform ordered ordinal encoding using Feature-engine. First, we must divide the dataset into train and test sets, as we did in step 2.

  1. Let’s import the encoder:
    from feature_engine.encoding import OrdinalEncoder
  2. Next, let’s set up the encoder so that it assigns integers by following the target value to all categorical variables in the dataset:
    ordinal_enc = OrdinalEncoder(
        encoding_method="ordered",
        variables=None)

Tip

OrdinalEncoder() will find and encode all categorical variables automatically. Alternatively, we can indicate which variables to encode by passing their names in a list to the variables argument.

  1. Let’s fit the encoder to the train set so that it finds the categorical variables, and then stores the category and integer mappings:
    ordinal_enc.fit(X_train, y_train)

Tip

When fitting the encoder, we need to pass the train set and the target, like with many scikit-learn predictor classes.

  1. Finally, let’s replace the categories with numbers in the train and test sets:
    X_train_enc = ordinal_enc.transform(X_train)
    X_test_enc = ordinal_enc.transform(X_test)

Tip

A list of the categorical variables is stored in the variables_ attribute of OrdinalEncoder() and the dictionaries with the category-to-integer mappings in the encoder_dict_ attribute. When fitting the encoder, we need to pass the train set and the target, like with many scikit-learn predictor classes.

Go ahead and check the monotonic relationship between other encoded categorical variables and the target by using the code in step 7 and changing the variable name in the groupby() method.

How it works...

In this recipe, we replaced the categories with integers according to the target mean.

In the first part of this recipe, we worked with the A7 categorical variable. With pandas groupby(), we grouped the data based on the categories of A7, and with pandas mean(), we determined the mean value of the target for each of the categories of A7. Next, we ordered the categories with pandas sort_values() from the ones with the lowest to the ones with the highest target mean response. The output of this operation was a pandas Series, with the categories as indices and the target mean as values. With pandas index, we captured the ordered categories in an array; then, with Python dictionary comprehension, we created a dictionary of category-to-integer pairs. Finally, we used this dictionary to replace the category with integers using pandas map() in the train and test sets.

Then, we plotted the relationship of the original and encoded variables with the target to visualize the monotonic relationship after the transformation. We determined the mean target value per category of A7 using pandas groupby(), followed by pandas mean(), as described in the preceding paragraph. We followed up with pandas plot() to create a plot of category versus target mean value. We added a title and y labels with Matplotlib’s title() and ylabel() methods.

To perform the encoding with Feature-engine, we used OrdinalEncoder() and indicated "ordered" in the encoding_method argument. We left the argument variables set to None so that the encoder automatically detects all categorical variables in the dataset. With the fit() method, the encoder found the categorical variables to encode and assigned digits to their categories, according to the target mean value. The variables to encode and dictionaries with category-to-digit pairs were stored in the variables_ and encoder_dict_ attributes, respectively. Finally, using the transform() method, the transformer replaced the categories with digits in the train and test sets, returning pandas DataFrames.

See also

For an implementation of this recipe with Category Encoders, visit this book’s GitHub repository.