Book Image

Python Feature Engineering Cookbook - Second Edition

By : Soledad Galli
Book Image

Python Feature Engineering Cookbook - Second Edition

By: Soledad Galli

Overview of this book

Feature engineering, the process of transforming variables and creating features, albeit time-consuming, ensures that your machine learning models perform seamlessly. This second edition of Python Feature Engineering Cookbook will take the struggle out of feature engineering by showing you how to use open source Python libraries to accelerate the process via a plethora of practical, hands-on recipes. This updated edition begins by addressing fundamental data challenges such as missing data and categorical values, before moving on to strategies for dealing with skewed distributions and outliers. The concluding chapters show you how to develop new features from various types of data, including text, time series, and relational databases. With the help of numerous open source Python libraries, you'll learn how to implement each feature engineering method in a performant, reproducible, and elegant manner. By the end of this Python book, you will have the tools and expertise needed to confidently build end-to-end and reproducible feature engineering pipelines that can be deployed into production.
Table of Contents (14 chapters)

Technical requirements

In this chapter, we will use the pandas, NumPy, and Matplotlib Python libraries, as well as scikit-learn and Feature-engine. For guidelines on how to obtain these libraries, visit the Technical requirements section of Chapter 1, Imputing Missing Data.

We will also use the open-source Category Encoders Python library, which can be installed using pip:

pip install category_encoders

To learn more about Category Encoders, visit the following link: https://contrib.scikit-learn.org/category_encoders/.

We will also use the Credit Approval dataset, which is available in the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/credit+approval.

To prepare the dataset, follow these steps:

  1. Visit http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/ and click on crx.data to download the data:
Figure 2.1 – The index directory for the Credit Approval dataset

Figure 2.1 – The index directory for the Credit Approval dataset

  1. Save crx.data to the folder where you will run the following commands.

After downloading the data, open up a Jupyter Notebook and run the following commands.

  1. Import the required libraries:
    import random
    import numpy as np
    import pandas as pd
  2. Load the data:
    data = pd.read_csv("crx.data", header=None)
  3. Create a list containing the variable names:
    varnames = [f"A{s}" for s in range(1, 17)]
  4. Add the variable names to the DataFrame:
    data.columns = varnames
  5. Replace the question marks in the dataset with NumPy NaN values:
    data = data.replace("?", np.nan)
  6. Cast some numerical variables as float data types:
    data["A2"] = data["A2"].astype("float")
    data["A14"] = data["A14"].astype("float")
  7. Encode the target variable as binary:
    data["A16"] = data["A16"].map({"+": 1, "-": 0})
  8. Rename the target variable:
    data.rename(columns={"A16": "target"}, inplace=True)
  9. Make lists that contain categorical and numerical variables:
    cat_cols = [
        c for c in data.columns if data[c].dtypes=="O"] 
    num_cols = [
        c for c in data.columns if data[c].dtypes!= "O"]
  10. Fill in the missing data:
    data[num_cols] = data[num_cols].fillna(0)
    data[cat_cols] = data[cat_cols].fillna("Missing")
  11. Save the prepared data:
    data.to_csv("credit_approval_uci.csv", index=False)

You can find a Jupyter Notebook that contains these commands in this book’s GitHub repository at https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook-Second-Edition/blob/main/ch02-categorical-encoding/donwload-prepare-store-credit-approval-dataset.ipynb.

Note

Some libraries require that you have already imputed missing data, for which you can use any of the recipes from Chapter 1, Imputing Missing Data.