# Performing one-hot encoding of frequent categories

One-hot encoding represents each variable’s category with a binary variable. Hence, one-hot encoding of highly cardinal variables or datasets with multiple categorical features can expand the feature space dramatically. This, in turn, may increase the computational cost of using machine learning models or deteriorate their performance. To reduce the number of binary variables, we can perform one-hot encoding of the most frequent categories. One-hot encoding the top categories is equivalent to treating the remaining, less frequent categories as a single, unique category.

In this recipe, we will implement one-hot encoding of the most popular categories using pandas and Feature-engine.

## How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

- Import the required Python libraries, functions, and classes:
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from feature_engine.encoding import OneHotEncoder

- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop( labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )

Tip

The most frequent categories need to be determined in the train set. This is to avoid data leakage.

- Let’s inspect the unique categories of the
`A6`

variable:X_train["A6"].unique()

The unique values of `A6`

are displayed in the following output:

array(['c', 'q', 'w', 'ff', 'm', 'i', 'e', 'cc', 'x', 'd', 'k', 'j', 'Missing, 'aa', 'r'], dtype=object)

- Let’s count the number of observations per category of
`A6`

, sort them in decreasing order, and then display the five most frequent categories:X_train["A6"].value_counts().sort_values( ascending=False).head(5)

We can see the five most frequent categories and the number of observations per category in the following output:

c 93 q 56 w 48 i 41 ff 38 Name: A6, dtype: int64

- Now, let’s capture the most frequent categories of
`A6`

in a list by using the code in*step 4*inside a list comprehension:top_5 = [ x for x in X_train["A6"].value_counts().sort_values( ascending=False).head(5).index ]

- Now, let’s add a binary variable per top category to the train and test sets:
for label in top_5: X_train[f"A6_{label}"] = np.where( X_train["A6"] ==label, 1, 0) X_test[f"A6_{label}"] = np.where( X_test["A6"] ==label, 1, 0)

- Let’s display the top
`10`

rows of the original and encoded variable,`A6`

, in the train set:X_train[["A6"] + [f"A6_{label}" for label in top_5]].head(10)

In the output of *step 7*, we can see the `A6`

variable, followed by the binary variables:

A6 A6_c A6_q A6_w A6_i A6_ff 596 c 1 0 0 0 0 303 q 0 1 0 0 0 204 w 0 0 1 0 0 351 ff 0 0 0 0 1 118 m 0 0 0 0 0 247 q 0 1 0 0 0 652 i 0 0 0 1 0 513 e 0 0 0 0 0 230 cc 0 0 0 0 0 250 e 0 0 0 0 0

We can automate one-hot encoding of frequent categories with Feature-engine. First, let’s load and divide the dataset, as we did in *step 2*.

- Let’s set up the one-hot encoder to encode the five most frequent categories of the
`A6`

and`A7`

variables:ohe_enc = OneHotEncoder( top_categories=5, variables=["A6", "A7"] )

Tip

Feature-engine’s `OneHotEncoder()`

will encode all categorical variables in the dataset by default unless we specify the variables to encode, as we did in *step 8*.

- Let’s fit the encoder to the train set so that it learns and stores the most frequent categories of
`A6`

and`A7`

:ohe_enc.fit(X_train)

Note

The number of frequent categories to encode is arbitrarily determined by the user.

- Finally, let’s encode
`A6`

and`A7`

in the train and test sets:X_train_enc = ohe_enc.transform(X_train) X_test_enc = ohe_enc.transform(X_test)

You can view the new binary variables in the DataFrame by executing `X_train_enc.head()`

. You can also find the top five categories learned by the encoder by executing `ohe_enc.encoder_dict_`

.

Note

Feature-engine replaces the original variable with the binary ones returned by one-hot encoding, leaving the dataset ready to use in machine learning.

## How it works...

In this recipe, we performed one-hot encoding of the five most popular categories using NumPy and Feature-engine.

In the first part of this recipe, we worked with the `A6`

categorical variable. We inspected its unique categories with pandas `unique()`

. Next, we counted the number of observations per category using pandas `value_counts()`

,which returned a pandas series with the categories as the index and the number of observations as values. Next, we sorted the categories from the one with the most to the one with the least observations using pandas `sort_values()`

. Next, we reduced the series to the five most popular categories by using pandas `head()`

. Then, we used this series in a list comprehension to capture the name of the most frequent categories. After that, we looped over each category, and with NumPy’s `where()`

method, we created binary variables by placing a value of `1`

if the observation showed the category, or `0`

otherwise.

To perform a one-hot encoding of the five most popular categories of the `A6`

and `A7`

variables with Feature-engine, we used `OneHotEncoder()`

, indicating `5`

in the `top_categories`

argument, and passing the variable names in a list to the `variables`

argument. With `fit()`

, the encoder learned the top categories from the train set and stored them in its `encoder_dict_`

attribute. Then, with `transform()`

, `OneHotEncoder()`

replaced the original variables with the set of binary ones.

## There’s more...

This recipe is based on the winning solution of the KDD 2009 cup, *Winning the KDD Cup Orange Challenge with Ensemble Selection* (http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf), where the authors limited one-hot encoding to the 10 most frequent categories of each variable.