Book Image

Python Feature Engineering Cookbook - Second Edition

By : Soledad Galli
Book Image

Python Feature Engineering Cookbook - Second Edition

By: Soledad Galli

Overview of this book

Feature engineering, the process of transforming variables and creating features, albeit time-consuming, ensures that your machine learning models perform seamlessly. This second edition of Python Feature Engineering Cookbook will take the struggle out of feature engineering by showing you how to use open source Python libraries to accelerate the process via a plethora of practical, hands-on recipes. This updated edition begins by addressing fundamental data challenges such as missing data and categorical values, before moving on to strategies for dealing with skewed distributions and outliers. The concluding chapters show you how to develop new features from various types of data, including text, time series, and relational databases. With the help of numerous open source Python libraries, you'll learn how to implement each feature engineering method in a performant, reproducible, and elegant manner. By the end of this Python book, you will have the tools and expertise needed to confidently build end-to-end and reproducible feature engineering pipelines that can be deployed into production.
Table of Contents (14 chapters)

Implementing target mean encoding

Mean encoding or target encoding maps each category to the probability estimate of the target attribute. If the target is binary, the numerical mapping is the posterior probability of the target conditioned to the value of the category. If the target is continuous, the numerical representation is given by the expected value of the target given the value of the category.

In its simplest form, the numerical representation for each category is given by the mean value of the target variable for a particular category group. For example, if we have a City variable, with the categories of London, Manchester, and Bristol, and we want to predict the default rate (the target takes values 0 and 1); if the default rate for London is 30%, we replace London with 0.3; if the default rate for Manchester is 20%, we replace Manchester with 0.2; and so on. If the target is continuous – say we want to predict income – then we would replace London, Manchester, and Bristol with the mean income earned in each city.

In mathematical terms, if the target is binary, the replacement value, S, is determined like so:

Here, the numerator is the number of observations with a target value of 1 for category i and the denominator is the number of observations with a category value of i.

If the target is continuous, S, this is determined by the following formula:

Here, the numerator is the sum of the target across observations in category i and ni is the total number of observations in category i.

These formulas provide a good approximation of the target estimate if there is a sufficiently large number of observations with each category value – in other words, if ni is large. However, in most datasets, categorical variables will only have categorical values present in a few observations. In these cases, target estimates derived from the precedent formulas can be unreliable.

To mitigate poor estimates returned for rare categories, the target estimates can be determined as a mixture of two probabilities: those returned by the preceding formulas and the prior probability of the target based on the entire training set. The two probabilities are blended using a weighting factor, which is a function of the category group size:

In this formula, ny is the total number of cases where the target takes a value of 1, N is the size of the train set, and 𝛌 is the weighting factor.

When the category group is large, 𝛌 approximates 1, so more weight is given to the first term of the equation. When the category group size is small, then 𝛌 tends to 0, so the estimate is mostly driven by the second term of the equation – that is, the target’s prior probability. In other words, if the group size is small, knowing the value of the category does not tell us anything about the value of the target.

The weighting factor, 𝛌, is a function of the group size, k, and a smoothing parameter, f, controls the rate of transition between the first and second term of the preceding equation:

Here, k is half of the minimal size for which we fully trust the first term of the equation. The f parameter is selected by the user either arbitrarily or with optimization.

Tip

Mean encoding was designed to encode highly cardinal categorical variables without expanding the feature space. For more details, check out the following article: Micci-Barreca D. A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. ACM SIGKDD Explorations Newsletter, 2001.

In this recipe, we will perform mean encoding using pandas, Feature-engine, and Category Encoders.

How to do it...

In the first part of this recipe, we will replace categories with the target mean value, regardless of the number of observations per category. We will use pandas and Feature-engine to do this. In the second part of this recipe, we will introduce the weighting factor using Category Encoders. Let’s begin with this recipe:

  1. Import pandas and the data split function:
    import pandas as pd
    from sklearn.model_selection import train_test_split
  2. Let’s load the dataset and divide it into train and test sets:
    data = pd.read_csv("credit_approval_uci.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  3. Let’s determine the mean target value per category of the A7 variable and then store them in a dictionary:
    mapping = y_train.groupby(X_train["A7"]).mean().to_dict()

We can display the content of the dictionary by executing print(mapping):

{'Missing': 1.0,
 'bb': 0.5128205128205128,
 'dd': 0.4,
 'ff': 0.14634146341463414,
 'h': 0.6039603960396039,
 'j': 0.2,
 'n': 0.6666666666666666,
 'o': 0.0,
 'v': 0.4187725631768953,
 'z': 0.7142857142857143}
  1. Let’s replace the categories with the mean target value using the dictionary we created in step 3 in the train and test sets:
    X_train["A7"] = X_train["A7"].map(mapping)
    X_test["A7"] = X_test["A7"].map(mapping)

You can inspect the encoded A7 variable by executing X_train["A7"].head().

Now, let’s perform target encoding with Feature-engine. First, we must split the data, as we did in step 2.

  1. Let’s import the encoder:
    from feature_engine.encoding import MeanEncoder
  2. Let’s set up the target mean encoder to encode all categorical variables:
    mean_enc = MeanEncoder(variables=None)

Tip

MeanEncoder() will find and encode all categorical variables by default. Alternatively, we can indicate the variables to encode by passing their names in a list to the variables argument.

  1. Let’s fit the transformer to the train set so that it learns and stores the mean target value per category per variable. Note that we need to pass both the train set and target to fit the encoder:
    mean_enc.fit(X_train, y_train)
  2. Finally, let’s encode the train and test sets:
    X_train_enc = mean_enc.transform(X_train)
    X_test_enc = mean_enc.transform(X_test)

Tip

The category-to-number pairs are stored as a dictionary of dictionaries in the encoder_dict_ attribute. To display the stored parameters, execute mean_enc.encoder_dict_.

Feature-engine returns pandas DataFrames containing the categorical variables, ready to use in machine learning models.

To wrap up, let’s implement mean encoding with Category Encoders blending the probabilities.

  1. Let’s import the encoder:
    from category_encoders.target_encoder import TargetEncoder
  2. Let’s set up the encoder so that it encodes all categorical variables using blended probabilities when there are less than 25 observations in the category group:
    mean_enc = TargetEncoder(
        cols=None, min_samples_leaf=25,
        smoothing=1.0
    )

Tip

TargetEncoder() finds categorical variables automatically by default. Alternatively, we can indicate the variables to encode by passing their names in a list to the cols argument. The smoothing parameter controls the blend of the prior and posterior probability. Higher values decrease the contribution of the posterior probability to the encoding.

  1. Let’s fit the transformer to the train set so that it learns and stores the numerical representations for each category:
    mean_enc.fit(X_train, y_train)

Note

The min_samples_leaf parameter refers to the minimum number of observations per category that a group should have to solely use the posterior probability. It is the equivalent of k in our weighting factor formula. In the original article, k was set to ½ of min_samples_leaf. Category encoders expose this value and thus, we can optimize it with cross-validation.

  1. Finally, let’s encode the train and test sets:
    X_train_enc = mean_enc.transform(X_train)
    X_test_enc = mean_enc.transform(X_test)

Category Encoders returns pandas DataFrames by default, where the original categorical variable values are replaced by their numerical representation. You can inspect the results by executing X_train_enc.head().

How it works…

In this recipe, we replaced the categories with the mean target value using pandas, Feature-engine, and Category Encoders.

With pandas groupby(), using the A7 categorical variable, followed by pandas mean() over the target variable, we created a pandas Series with the categories as indices and the target mean as values. With pandas to_dict(), we converted this Series into a dictionary. Finally, we used this dictionary to replace the categories in the train and test sets using pandas map().

To perform the encoding with Feature-engine, we used MeanEncoder(). With fit(), the transformer found and stored the categorical variables and the mean target value per category. With transform(), categories were replaced with numbers in the train and test sets, returning pandas DataFrames.

Finally, we used TargetEncoder() from Category Encoders to replace categories with a blend of prior and posterior probability estimates of the target. We set min_samples_leaf to 25, which meant that if a category group had 25 observations or more, then the posterior probability was used for the encoding; alternatively, a blend of probabilities was used for the encoding. With fit(), the transformer found the categorical variables and the numerical representation of the categories, while with transform(), the categories were replaced with numbers, returning pandas DataFrames with their encoded values.

There’s more…

There is an alternative way to return better target estimates when the category groups are small. The replacement value for each category is determined as follows:

Here, ni(Y=1) is the target mean for category i and ni is the number of observations with category i. The target prior is given by pY and m is the weighting factor. With this adjustment, the only parameter that we have to set is the weight, m. If m is large, then more importance is given to the target’s prior probability. This adjustment affects target estimates for all categories but mostly for those with fewer observations because, in such cases, m could be much larger than ni in the formula’s denominator.

For an implementation of this encoding using MEstimateEncoder(), visit this book’s GitHub repository.