# Replacing categories with ordinal numbers

Ordinal encoding consists of replacing the categories with digits from 1 to *k* (or 0 to *k-1*, depending on the implementation), where *k* is the number of distinct categories of the variable. The numbers are assigned arbitrarily. Ordinal encoding is better suited for non-linear machine learning models, which can navigate through the arbitrarily assigned digits to find patterns that relate to the target.

In this recipe, we will perform ordinal encoding using pandas, scikit-learn, and Feature-engine.

## How to do it...

First, let’s import the necessary Python libraries and prepare the dataset:

- Import
`pandas`

and the`data`

`split`

function:import pandas as pd from sklearn.model_selection import train_test_split

- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )

- To encode the
`A7`

variable, let’s make a dictionary of category-to-integer pairs:ordinal_mapping = {k: i for i, k in enumerate( X_train["A7"].unique(), 0) }

If we execute `print(ordinal_mapping)`

, we will see the digits that will replace each category:

{'v': 0, 'ff': 1, 'h': 2, 'dd': 3, 'z': 4, 'bb': 5, 'j': 6, 'Missing': 7, 'n': 8, 'o': 9}

- Now, let’s replace the categories with numbers in the original variables:
X_train["A7"] = X_train["A7"].map(ordinal_mapping) X_test["A7"] = X_test["A7"].map(ordinal_mapping)

With `print(X_train["A7"].head(10))`

, we can see the result of the preceding operation, where the original categories were replaced by numbers:

596 0 303 0 204 0 351 1 118 0 247 2 652 0 513 3 230 0 250 4 Name: A7, dtype: int64

Next, let’s carry out ordinal encoding using scikit-learn. First, we need to divide the data into train and test sets, as we did in *step 2*.

- Let’s import the required classes:
from sklearn.preprocessing import OrdinalEncoder from sklearn.compose import ColumnTransformer

Tip

Do not confuse `OrdinalEncoder()`

with `LabelEncoder()`

from scikit-learn. The former is intended to encode predictive features, whereas the latter is intended to modify the target variable.

- Let’s set up the encoder:
enc = OrdinalEncoder()

Note

Scikit-learn’s `OrdinalEncoder()`

will encode the entire dataset. To encode only a selection of variables, we need to use scikit-learn’s `ColumnTransformer()`

.

- Let’s make a list containing the categorical variables to encode:
vars_categorical = X_train.select_dtypes( include="O").columns.to_list()

- Let’s make a list containing the remaining variables:
vars_remainder = X_train.select_dtypes( exclude="O").columns.to_list()

- Now, let’s set up
`ColumTransformer()`

to encode the categorical variables. By setting the`remainder`

parameter to`"passthrough"`

, we make`ColumnTransformer()`

concatenate the variables that are not encoded at the back of the encoded features:ct = ColumnTransformer( [("encoder", enc, vars_categorical)], remainder="passthrough", )

- Let’s fit the encoder to the train set so that it creates and stores representations of categories to digits:
ct.fit(X_train)

By executing `ct.named_transformers_["encoder"].categories_`

, you can visualize the unique categories per variable.

- Now, let’s encode the categorical variables in the train and test sets:
X_train_enc = ct.transform(X_train) X_test_enc = ct.transform(X_test)

Remember that scikit-learn returns a NumPy array.

- Let’s transform the arrays into pandas DataFrames by adding the columns:
X_train_enc = pd.DataFrame( X_train_enc, columns=vars_categorical+vars_remainder) X_test_enc = pd.DataFrame( X_test_enc, columns=vars_categorical+vars_remainder)

Note

Note that, with `ColumnTransformer()`

, the variables that were not encoded will be returned to the right of the DataFrame, following the encoded variables. You can visualize the output of *step 12* with `X_train_enc.head()`

.

Now, let’s do ordinal encoding with Feature-engine. First, we must divide the dataset, as we did in *step 2*.

- Let’s import the encoder:
from feature_engine.encoding import OrdinalEncoder

- Let’s set up the encoder so that it replaces categories with arbitrary integers in the categorical variables specified in
*step 7*:enc = OrdinalEncoder(

**encoding_method=**"**arbitrary**", variables=vars_categorical)

Note

Feature-engine’s `OrdinalEncoder`

automatically finds and encodes all categorical variables if the `variables`

parameter is left set to `None`

. Alternatively, it will encode the variables indicated in the list. In addition, Feature-engine’s `OrdinalEncoder()`

can assign the integers according to the target mean value (see the *Performing ordinal encoding based on the target **value* recipe).

- Let’s fit the encoder to the train set so that it learns and stores the category-to-integer mappings:
enc.fit(X_train)

Tip

The category to integer mappings are stored in the `encoder_dict_`

attribute and can be accessed by executing `enc.encoder_dict_`

.

- Finally, let’s encode the categorical variables in the train and test sets:
X_train_enc = enc.transform(X_train) X_test_enc = enc.transform(X_test)

Feature-engine returns pandas DataFrames where the values of the original variables are replaced with numbers, leaving the DataFrame ready to use in machine learning models.

## How it works...

In this recipe, we replaced categories with integers assigned arbitrarily.

With pandas `unique()`

, we returned the unique values of the `A7`

variable, and using Python’s list comprehension syntax, we created a dictionary of key-value pairs, where each key was one of the `A7`

variable’s unique categories, and each value was the digit that would replace the category. Finally, we used pandas `map()`

to replace the strings in `A7`

with the integers.

Next, we carried out ordinal encoding using scikit-learn’s `OrdinalEncoder()`

and used `ColumnTransformer()`

to select the columns to encode. With the `fit()`

method, the transformer created the category-to-integer mappings based on the categories in the train set. With the `transform()`

method, the categories were replaced with integers, returning a NumPy array. `ColumnTransformer()`

sliced the DataFrame into the categorical variables to encode, and then concatenated the remaining variables at the right of the encoded features.

To perform ordinal encoding with Feature-engine, we used `OrdinalEncoder()`

, indicating that the integers should be assigned arbitrarily in `encoding_method`

and passing a list with the variables to encode in the `variables`

argument. With the `fit()`

method, the encoder assigned integers to each variable’s categories, which were stored in the `encoder_dict_`

attribute. These mappings were then used by the `transform()`

method to replace the categories in the train and test sets, returning DataFrames.

## There’s more...

You can also carry out ordinal encoding with `OrdinalEncoder()`

from Category Encoders.

The transformers from Feature-engine and Category Encoders can automatically identify and encode categorical variables – that is, those of the object or categorical type. They also allow us to encode only a subset of the variables.

scikit-learn’s transformer will otherwise encode all variables in the dataset. To encode just a subset, we need to use an additional class, `ColumnTransformer()`

, to slice the data before the transformation.

Feature-engine and Category Encoders return pandas DataFrames, whereas scikit-learn returns NumPy arrays.

Finally, each class has additional functionality. For example, with scikit-learn, we can encode only a subset of the categories, whereas Feature-engine allows us to replace categories with integers that are assigned based on the target mean value. On the other hand, Category Encoders can automatically handle missing data and offers alternative options to work with unseen categories.