Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Python Feature Engineering Cookbook
  • Table Of Contents Toc
  • Feedback & Rating feedback
Python Feature Engineering Cookbook

Python Feature Engineering Cookbook - Second Edition

By : Galli
4.8 (16)
close
close
Python Feature Engineering Cookbook

Python Feature Engineering Cookbook

4.8 (16)
By: Galli

Overview of this book

Feature engineering, the process of transforming variables and creating features, albeit time-consuming, ensures that your machine learning models perform seamlessly. This second edition of Python Feature Engineering Cookbook will take the struggle out of feature engineering by showing you how to use open source Python libraries to accelerate the process via a plethora of practical, hands-on recipes. This updated edition begins by addressing fundamental data challenges such as missing data and categorical values, before moving on to strategies for dealing with skewed distributions and outliers. The concluding chapters show you how to develop new features from various types of data, including text, time series, and relational databases. With the help of numerous open source Python libraries, you'll learn how to implement each feature engineering method in a performant, reproducible, and elegant manner. By the end of this Python book, you will have the tools and expertise needed to confidently build end-to-end and reproducible feature engineering pipelines that can be deployed into production.
Table of Contents (14 chapters)
close
close

Performing binary encoding

Binary encoding is a categorical encoding technique that uses binary code – that is, a sequence of zeroes and ones – to represent the different categories of the variable. How does it work? First, the categories are arbitrarily replaced with ordinal numbers, as shown in the intermediate step of the following table. Then, those numbers are converted into binary code. For example, integer 1 can be represented as sequence 10, integer 2 as 01, integer 3 as 11, and integer 0 as 00. The digits in the two positions of the binary string become the columns, which are the encoded representations of the original variable:

Figure 2.9 – Table showing the steps required for binary encoding of the color variable

Figure 2.9 – Table showing the steps required for binary encoding of the color variable

Binary encoding encodes the data in fewer dimensions than one-hot encoding. In our example, the Color variable would be encoded into k-1 categories by one-hot encoding – that is, three variables – but with binary encoding, we can represent the variable with only two features. More generally, we determine the number of binary features needed to encode a variable as log2(number of distinct categories); in our example, log2(4) = 2 binary features.

Binary encoding is an alternative method to one-hot encoding where we do not lose information about the variable, yet we obtain fewer features after the encoding. This is particularly useful when we have highly cardinal variables. For example, if a variable contains 128 unique categories, with one-hot encoding, we would need 127 features to encode the variable, whereas with binary encoding, we would only need 7 (log2(128)=7). Thus, this encoding prevents the feature space from exploding. In addition, binary-encoded features are also suitable for linear models. On the downside, the derived binary features lack human interpretability, so if we need to interpret the decisions made by our models, this encoding method may not be a suitable option.

In this recipe, we will learn how to perform binary encoding using Category Encoders.

How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

  1. Import the required Python library, function, and class:
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from category_encoders.binary import BinaryEncoder
  2. Let’s load the dataset and divide it into train and test sets:
    data = pd.read_csv("credit_approval_uci.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  3. Let’s inspect the unique categories in A7:
    X_train["A7"].unique()

In the following output, we can see that A7 has 10 different categories:

array(['v', 'ff', 'h', 'dd', 'z', 'bb', 'j', 'Missing', 'n', 'o'], dtype=object)
  1. Let’s create a binary encoder to encode A7:
    encoder = BinaryEncoder(cols=["A7"], drop_invariant=True)

Tip

BinaryEncoder(), as well as other encoders from the Category Encoders package, allow us to select the variables to encode. We simply pass the column names in a list to the cols argument.

  1. Let’s fit the transformer to the train set so that it calculates how many binary variables it needs and creates the variable-to-binary code representations:
    encoder.fit(X_train)
  2. Finally, let’s encode A7 in the train and test sets:
    X_train_enc = encoder.transform(X_train)
    X_test_enc = encoder.transform(X_test)

We can display the top rows of the transformed train set by executing print(X_train_enc.head()), which returns the following output:

Figure 2.10 – DataFrame with the variables after binary encoding

Figure 2.10 – DataFrame with the variables after binary encoding

Binary encoding returned four binary variables for A7, which are A7_0, A7_1, A7_2, and A7_3, instead of the nine that would have been returned by one-hot encoding.

How it works...

In this recipe, we performed binary encoding using the Category Encoders package. First, we loaded the dataset and divided it into train and test sets using train_test_split() from scikit-learn. Next, we used BinaryEncoder() to encode the A7 variable. With the fit() method, BinaryEncoder() created a mapping from category to set of binary columns, and with the transform() method, the encoder encoded the A7 variable in both the train and test sets.

Tip

With one-hot encoding, we would have created nine binary variables (k-1 = 10 unique categories - 1 = 9) to encode all of the information in A7. With binary encoding, we can represent the variable in fewer dimensions by using log2(10)=3.3; that is, we only need four binary variables.

See also

For more information about BinaryEncoder(), visit https://contrib.scikit-learn.org/category_encoders/binary.html.

For a nice example of the output of binary encoding, check out the following resource: https://stats.stackexchange.com/questions/325263/binary-encoding-vs-one-hot-encoding.

For a comparative study of categorical encoding techniques for neural network classifiers, visit https://www.researchgate.net/publication/320465713_A_Comparative_Study_of_Categorical_Variable_Encoding_Techniques_for_Neural_Network_Classifiers.

Visually different images
CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Python Feature Engineering Cookbook
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon