# Performing ordinal encoding based on the target value

In the previous recipe, we replaced categories with integers, which were assigned arbitrarily. We can also assign integers to the categories given the target values. To do this, first, we must calculate the mean value of the target per category. Next, we must order the categories from the one with the lowest to the one with the highest target mean value. Finally, we must assign digits to the ordered categories, starting with 0 to the first category up to *k-1* to the last category, where *k* is the number of distinct categories.

This encoding method creates a monotonic relationship between the categorical variable and the response and therefore makes the variables more adequate for use in linear models.

In this recipe, we will encode categories while following the target value using pandas and Feature-engine.

## How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

- Import the required Python libraries, functions, and classes:
import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split

- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )

- Let’s determine the mean target value per category in
`A7`

, then sort the categories from that with the lowest to that with the highest target value:y_train.groupby(X_train["A7"]).mean().sort_values()

The following is the output of the preceding command:

A7 o 0.000000 ff 0.146341 j 0.200000 dd 0.400000 v 0.418773 bb 0.512821 h 0.603960 n 0.666667 z 0.714286 Missing 1.000000 Name: target, dtype: float64

- Now, let’s repeat the computation in
*step 3*, but this time, let’s retain the ordered category names:ordered_labels = y_train.groupby( X_train["A7"]).mean().sort_values().index

To display the output of the preceding command, we can execute `print(ordered_labels)`

:

Index(['o', 'ff', 'j', 'dd', 'v', 'bb', 'h', 'n', 'z', 'Missing'], dtype='object', name='A7')

- Let’s create a dictionary of category-to-integer pairs, using the ordered list we created in
*step 4*:ordinal_mapping = { k: i for i, k in enumerate( ordered_labels, 0) }

We can visualize the result of the preceding code by executing `print(ordinal_mapping)`

:

{'o': 0, 'ff': 1, 'j': 2, 'dd': 3, 'v': 4, 'bb': 5, 'h': 6, 'n': 7, 'z': 8, 'Missing': 9}

- Let’s use the dictionary we created in
*step 5*to replace the categories in`A7`

in the train and test sets, returning the encoded features as new columns:X_train["A7_enc"] = X_train["A7"].map(ordinal_mapping) X_test["A7_enc"] = X_test["A7"].map(ordinal_mapping)

Tip

Note that if the test set contains a category not present in the train set, the preceding code will introduce `np.nan`

.

To better understand the monotonic relationship concept, let’s plot the relationship of the categories of the `A7`

variable with the target before and after the encoding.

- Let’s plot the mean target response per category of the
`A7`

variable:y_train.groupby(X_train["A7"]).mean().plot() plt.title("Relationship between A7 and the target") plt.ylabel("Mean of target") plt.show()

We can see the non-monotonic relationship between categories of `A7`

and the target in the following plot:

Figure 2.7 – Relationship between the categories of A7 and the target

- Let’s plot the mean target value per category in the encoded variable:
y_train.groupby(X_train["A7_enc"]).mean().plot() plt.title("Relationship between A7 and the target") plt.ylabel("Mean of target") plt.show()

The encoded variable shows a monotonic relationship with the target – the higher the mean target value, the higher the digit assigned to the category:

Figure 2.8 – Relationship between A7 and the target after the encoding

Now, let’s perform ordered ordinal encoding using Feature-engine. First, we must divide the dataset into train and test sets, as we did in *step 2*.

- Let’s import the encoder:
from feature_engine.encoding import OrdinalEncoder

- Next, let’s set up the encoder so that it assigns integers by following the target value to all categorical variables in the dataset:
ordinal_enc = OrdinalEncoder( encoding_method="ordered", variables=None)

Tip

`OrdinalEncoder()`

will find and encode all categorical variables automatically. Alternatively, we can indicate which variables to encode by passing their names in a list to the variables argument.

- Let’s fit the encoder to the train set so that it finds the categorical variables, and then stores the category and integer mappings:
ordinal_enc.fit(X_train, y_train)

Tip

When fitting the encoder, we need to pass the train set and the target, like with many scikit-learn predictor classes.

- Finally, let’s replace the categories with numbers in the train and test sets:
X_train_enc = ordinal_enc.transform(X_train) X_test_enc = ordinal_enc.transform(X_test)

Tip

A list of the categorical variables is stored in the `variables_`

attribute of `OrdinalEncoder()`

and the dictionaries with the category-to-integer mappings in the `encoder_dict_`

attribute. When fitting the encoder, we need to pass the train set and the target, like with many scikit-learn predictor classes.

Go ahead and check the monotonic relationship between other encoded categorical variables and the target by using the code in *step 7* and changing the variable name in the `groupby()`

method.

## How it works...

In this recipe, we replaced the categories with integers according to the target mean.

In the first part of this recipe, we worked with the `A7`

categorical variable. With pandas `groupby()`

, we grouped the data based on the categories of `A7`

, and with pandas `mean()`

, we determined the mean value of the target for each of the categories of `A7`

. Next, we ordered the categories with pandas `sort_values()`

from the ones with the lowest to the ones with the highest target mean response. The output of this operation was a pandas Series, with the categories as indices and the target mean as values. With pandas `index`

, we captured the ordered categories in an array; then, with Python dictionary comprehension, we created a dictionary of category-to-integer pairs. Finally, we used this dictionary to replace the category with integers using pandas `map()`

in the train and test sets.

Then, we plotted the relationship of the original and encoded variables with the target to visualize the monotonic relationship after the transformation. We determined the mean target value per category of `A7`

using pandas `groupby()`

, followed by pandas `mean()`

, as described in the preceding paragraph. We followed up with pandas `plot()`

to create a plot of category versus target mean value. We added a title and *y* labels with Matplotlib’s `title()`

and `ylabel()`

methods.

To perform the encoding with Feature-engine, we used `OrdinalEncoder()`

and indicated `"ordered"`

in the `encoding_method`

argument. We left the argument variables set to `None`

so that the encoder automatically detects all categorical variables in the dataset. With the `fit()`

method, the encoder found the categorical variables to encode and assigned digits to their categories, according to the target mean value. The variables to encode and dictionaries with category-to-digit pairs were stored in the `variables_`

and `encoder_dict_`

attributes, respectively. Finally, using the `transform()`

method, the transformer replaced the categories with digits in the train and test sets, returning pandas DataFrames.

## See also

For an implementation of this recipe with Category Encoders, visit this book’s GitHub repository.