# Implementing target mean encoding

Mean encoding or target encoding maps each category to the probability estimate of the target attribute. If the target is binary, the numerical mapping is the posterior probability of the target conditioned to the value of the category. If the target is continuous, the numerical representation is given by the expected value of the target given the value of the category.

In its simplest form, the numerical representation for each category is given by the mean value of the target variable for a particular category group. For example, if we have a **City** variable, with the categories of **London**, **Manchester**, and **Bristol**, and we want to predict the default rate (the target takes values 0 and 1); if the default rate for **London** is 30%, we replace **London** with 0.3; if the default rate for **Manchester** is 20%, we replace **Manchester** with 0.2; and so on. If the target is continuous – say we want to predict income – then we would replace London, Manchester, and Bristol with the mean income earned in each city.

In mathematical terms, if the target is binary, the replacement value, *S*, is determined like so:

Here, the numerator is the number of observations with a target value of 1 for category *i* and the denominator is the number of observations with a category value of *i*.

If the target is continuous, *S*, this is determined by the following formula:

Here, the numerator is the sum of the target across observations in category *i* and *ni* is the total number of observations in category *i*.

These formulas provide a good approximation of the target estimate if there is a sufficiently large number of observations with each category value – in other words, if *n*i is large. However, in most datasets, categorical variables will only have categorical values present in a few observations. In these cases, target estimates derived from the precedent formulas can be unreliable.

To mitigate poor estimates returned for rare categories, the target estimates can be determined as a mixture of two probabilities: those returned by the preceding formulas and the prior probability of the target based on the entire training set. The two probabilities are *blended* using a weighting factor, which is a function of the category group size:

In this formula, ny is the total number of cases where the target takes a value of 1, *N* is the size of the train set, and 𝛌 is the weighting factor.

When the category group is large, 𝛌 approximates 1, so more weight is given to the first term of the equation. When the category group size is small, then 𝛌 tends to 0, so the estimate is mostly driven by the second term of the equation – that is, the target’s prior probability. In other words, if the group size is small, knowing the value of the category does not tell us anything about the value of the target.

The weighting factor, 𝛌, is a function of the group size, *k*, and a smoothing parameter, *f*, controls the rate of transition between the first and second term of the preceding equation:

Here, *k* is half of the minimal size for which we *fully trust* the first term of the equation. The *f* parameter is selected by the user either arbitrarily or with optimization.

Tip

Mean encoding was designed to encode highly cardinal categorical variables without expanding the feature space. For more details, check out the following article: Micci-Barreca D. *A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems*. ACM SIGKDD Explorations Newsletter, 2001.

In this recipe, we will perform mean encoding using pandas, Feature-engine, and Category Encoders.

## How to do it...

In the first part of this recipe, we will replace categories with the target mean value, regardless of the number of observations per category. We will use pandas and Feature-engine to do this. In the second part of this recipe, we will introduce the weighting factor using Category Encoders. Let’s begin with this recipe:

- Import
`pandas`

and the data split function:import pandas as pd from sklearn.model_selection import train_test_split

- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )

- Let’s determine the mean target value per category of the
`A7`

variable and then store them in a dictionary:mapping = y_train.groupby(X_train["A7"]).mean().to_dict()

We can display the content of the dictionary by executing `print(mapping)`

:

{'Missing': 1.0, 'bb': 0.5128205128205128, 'dd': 0.4, 'ff': 0.14634146341463414, 'h': 0.6039603960396039, 'j': 0.2, 'n': 0.6666666666666666, 'o': 0.0, 'v': 0.4187725631768953, 'z': 0.7142857142857143}

- Let’s replace the categories with the mean target value using the dictionary we created in
*step 3*in the train and test sets:X_train["A7"] = X_train["A7"].map(mapping) X_test["A7"] = X_test["A7"].map(mapping)

You can inspect the encoded `A7`

variable by executing `X_train["A7"].head()`

.

Now, let’s perform target encoding with Feature-engine. First, we must split the data, as we did in *step 2*.

- Let’s import the encoder:
from feature_engine.encoding import MeanEncoder

- Let’s set up the target mean encoder to encode all categorical variables:
mean_enc = MeanEncoder(variables=None)

Tip

`MeanEncoder()`

will find and encode all categorical variables by default. Alternatively, we can indicate the variables to encode by passing their names in a list to the variables argument.

- Let’s fit the transformer to the train set so that it learns and stores the mean target value per category per variable. Note that we need to pass both the train set and target to fit the encoder:
mean_enc.fit(X_train, y_train)

- Finally, let’s encode the train and test sets:
X_train_enc = mean_enc.transform(X_train) X_test_enc = mean_enc.transform(X_test)

Tip

The category-to-number pairs are stored as a dictionary of dictionaries in the `encoder_dict_`

attribute. To display the stored parameters, execute `mean_enc.encoder_dict_`

.

Feature-engine returns pandas DataFrames containing the categorical variables, ready to use in machine learning models.

To wrap up, let’s implement mean encoding with Category Encoders blending the probabilities.

- Let’s import the encoder:
from category_encoders.target_encoder import TargetEncoder

- Let’s set up the encoder so that it encodes all categorical variables using blended probabilities when there are less than 25 observations in the category group:
mean_enc = TargetEncoder( cols=None, min_samples_leaf=25, smoothing=1.0 )

Tip

`TargetEncoder()`

finds categorical variables automatically by default. Alternatively, we can indicate the variables to encode by passing their names in a list to the `cols`

argument. The `smoothing`

parameter controls the blend of the prior and posterior probability. Higher values decrease the contribution of the posterior probability to the encoding.

- Let’s fit the transformer to the train set so that it learns and stores the numerical representations for each category:
mean_enc.fit(X_train, y_train)

Note

The `min_samples_leaf`

parameter refers to the minimum number of observations per category that a group should have to solely use the posterior probability. It is the equivalent of `k`

in our weighting factor formula. In the original article, `k`

was set to ½ of `min_samples_leaf`

. Category encoders expose this value and thus, we can optimize it with cross-validation.

- Finally, let’s encode the train and test sets:
X_train_enc = mean_enc.transform(X_train) X_test_enc = mean_enc.transform(X_test)

Category Encoders returns pandas DataFrames by default, where the original categorical variable values are replaced by their numerical representation. You can inspect the results by executing `X_train_enc.head()`

.

## How it works…

In this recipe, we replaced the categories with the mean target value using pandas, Feature-engine, and Category Encoders.

With pandas `groupby()`

, using the `A7`

categorical variable, followed by pandas `mean()`

over the target variable, we created a pandas Series with the categories as indices and the target mean as values. With pandas `to_dict()`

, we converted this Series into a dictionary. Finally, we used this dictionary to replace the categories in the train and test sets using pandas `map()`

.

To perform the encoding with Feature-engine, we used `MeanEncoder()`

. With `fit()`

, the transformer found and stored the categorical variables and the mean target value per category. With `transform()`

, categories were replaced with numbers in the train and test sets, returning pandas DataFrames.

Finally, we used `TargetEncoder()`

from Category Encoders to replace categories with a blend of prior and posterior probability estimates of the target. We set `min_samples_leaf`

to 25, which meant that if a category group had 25 observations or more, then the posterior probability was used for the encoding; alternatively, a blend of probabilities was used for the encoding. With `fit()`

, the transformer found the categorical variables and the numerical representation of the categories, while with `transform()`

, the categories were replaced with numbers, returning pandas DataFrames with their encoded values.

## There’s more…

There is an alternative way to return *better* target estimates when the category groups are small. The replacement value for each category is determined as follows:

Here, ni(Y=1) is the target mean for category *i* and *n*i is the number of observations with category *i*. The target prior is given by *pY* and m is the weighting factor. With this adjustment, the only parameter that we have to set is the weight, *m*. If *m* is large, then more importance is given to the target’s prior probability. This adjustment affects target estimates for all categories but mostly for those with fewer observations because, in such cases, m could be much larger than *n*i in the formula’s denominator.

For an implementation of this encoding using `MEstimateEncoder()`

, visit this book’s GitHub repository.