Python Feature Engineering Cookbook

By : Soledad Galli

Python Feature Engineering Cookbook

By: Soledad Galli

Overview of this book

Feature engineering is invaluable for developing and enriching your machine learning models. In this cookbook, you will work with the best tools to streamline your feature engineering pipelines and techniques and simplify and improve the quality of your code. Using Python libraries such as pandas, scikit-learn, Featuretools, and Feature-engine, you’ll learn how to work with both continuous and discrete datasets and be able to transform features from unstructured datasets. You will develop the skills necessary to select the best features as well as the most suitable extraction techniques. This book will cover Python recipes that will help you automate feature engineering to simplify complex processes. You’ll also get to grips with different feature engineering strategies, such as the box-cox transform, power transform, and log transform across machine learning, reinforcement learning, and natural language processing (NLP) domains. By the end of this book, you’ll have discovered tips and practical solutions to all of your feature engineering problems.

Preface

Who this book is for

What this book covers

To get the most out of this book

Sections

Get in touch

Foreseeing Variable Problems When Building ML Models

Technical requirements

Identifying numerical and categorical variables

Quantifying missing data

Determining cardinality in categorical variables

Pinpointing rare categories in categorical variables

Identifying a linear relationship

Identifying a normal distribution

Distinguishing variable distribution

Highlighting outliers

Comparing feature magnitude

Free Chapter

Imputing Missing Data

Technical requirements

Removing observations with missing data

Performing mean or median imputation

Implementing mode or frequent category imputation

Replacing missing values with an arbitrary number

Capturing missing values in a bespoke category

Replacing missing values with a value at the end of the distribution

Implementing random sample imputation

Adding a missing value indicator variable

Performing multivariate imputation by chained equations

Assembling an imputation pipeline with scikit-learn

Assembling an imputation pipeline with Feature-engine

Encoding Categorical Variables

Technical requirements

Creating binary variables through one-hot encoding

Performing one-hot encoding of frequent categories

Replacing categories with ordinal numbers

Replacing categories with counts or frequency of observations

Encoding with integers in an ordered manner

Encoding with the mean of the target

Encoding with the Weight of Evidence

Grouping rare or infrequent categories

Performing binary encoding

Performing feature hashing

Transforming Numerical Variables

Technical requirements

Transforming variables with the logarithm

Transforming variables with the reciprocal function

Using square and cube root to transform variables

Using power transformations on numerical variables

Performing Box-Cox transformation on numerical variables

Performing Yeo-Johnson transformation on numerical variables

Performing Variable Discretization

Technical requirements

Dividing the variable into intervals of equal width

Sorting the variable values in intervals of equal frequency

Performing discretization followed by categorical encoding

Allocating the variable values in arbitrary intervals

Performing discretization with k-means clustering

Using decision trees for discretization

Working with Outliers

Technical requirements

Trimming outliers from the dataset

Performing winsorization

Capping the variable at arbitrary maximum and minimum values

Performing zero-coding – capping the variable at zero

Deriving Features from Dates and Time Variables

Technical requirements

Extracting date and time parts from a datetime variable

Deriving representations of the year and month

Creating representations of day and week

Extracting time parts from a time variable

Capturing the elapsed time between datetime variables

Working with time in different time zones

Performing Feature Scaling

Technical requirements

Standardizing the features

Performing mean normalization

Scaling to the maximum and minimum values

Implementing maximum absolute scaling

Scaling with the median and quantiles

Scaling to vector unit length

Applying Mathematical Computations to Features

Technical requirements

Combining multiple features with statistical operations

Combining pairs of features with mathematical functions

Performing polynomial expansion

Deriving new features with decision trees

Carrying out PCA

Creating Features with Transactional and Time Series Data

Technical requirements

Aggregating transactions with mathematical operations

Aggregating transactions in a time window

Determining the number of local maxima and minima

Deriving time elapsed between time-stamped events

Creating features from transactions with Featuretools

Extracting Features from Text Variables

Technical requirements

Counting characters, words, and vocabulary

Estimating text complexity by counting sentences

Creating features with bag-of-words and n-grams

Implementing term frequency-inverse document frequency

Cleaning and stemming text variables

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Quantifying missing data

Missing data refers to the absence of a value for observations and is a common occurrence in most datasets. Scikit-learn, the open source Python library for machine learning, does not support missing values as input for machine learning models, so we need to convert these values into numbers. To select the missing data imputation technique, it is important to know about the amount of missing information in our variables. In this recipe, we will learn how to identify and quantify missing data using pandas and how to make plots with the percentages of missing data per variable.

Getting ready

In this recipe, we will use the KDD-CUP-98 dataset from the UCI Machine Learning Repository. To download this dataset, follow the instructions in the Technical requirements section of this chapter.

How to do it...

First, let's import the necessary Python libraries:

Import the required Python libraries:

import pandas as pd
import matplotlib.pyplot as plt

Let's load a few variables from the dataset into a pandas dataframe and inspect the first five rows:

cols = ['AGE', 'NUMCHLD', 'INCOME', 'WEALTH1', 'MBCRAFT', 'MBGARDEN', 'MBBOOKS', 'MBCOLECT', 'MAGFAML','MAGFEM', 'MAGMALE']

data = pd.read_csv('cup98LRN.txt', usecols=cols)
data.head()

After loading the dataset, this is how the output of head() looks like when we run it from a Jupyter Notebook:

Let's calculate the number of missing values in each variable:

data.isnull().sum()

The number of missing values per variable can be seen in the following output:

AGE         23665
NUMCHLD     83026
INCOME      21286
WEALTH1     44732
MBCRAFT     52854
MBGARDEN    52854
MBBOOKS     52854
MBCOLECT    52914
MAGFAML     52854
MAGFEM      52854
MAGMALE     52854
dtype: int64

Let's quantify the percentage of missing values in each variable:

data.isnull().mean()

The percentages of missing values per variable can be seen in the following output, expressed as decimals:

AGE         0.248030
NUMCHLD     0.870184
INCOME      0.223096
WEALTH1     0.468830
MBCRAFT     0.553955
MBGARDEN    0.553955
MBBOOKS     0.553955
MBCOLECT    0.554584
MAGFAML     0.553955
MAGFEM      0.553955
MAGMALE 0.553955
dtype: float64

Finally, let's make a bar plot with the percentage of missing values per variable:

data.isnull().mean().plot.bar(figsize=(12,6))
plt.ylabel('Percentage of missing values')
plt.xlabel('Variables')
plt.title('Quantifying missing data')

The bar plot that's returned by the preceding code block displays the percentage of missing data per variable:

We can change the figure size using the figsize argument within pandas plot.bar() and we can add x and y labels and a title with the plt.xlabel(), plt.ylabel(), and plt.title() methods from Matplotlib to enhance the aesthetics of the plot.

How it works...

In this recipe, we quantified and displayed the amount and percentage of missing data of a publicly available dataset.

To load data from the txt file into a dataframe, we used the pandas read_csv() method. To load only certain columns from the original data, we created a list with the column names and passed this list to the usecols argument of read_csv(). Then, we used the head() method to display the top five rows of the dataframe, along with the variable names and some of their values.

To identify missing observations, we used pandas isnull(). This created a boolean vector per variable, with each vector indicating whether the value was missing (True) or not (False) for each row of the dataset. Then, we used the pandas sum() and mean() methods to operate over these boolean vectors and calculate the total number or the percentage of missing values, respectively. The sum() method sums the True values of the boolean vectors to find the total number of missing values, whereas the mean() method takes the average of these values and returns the percentage of missing data, expressed as decimals.

To display the percentages of the missing values in a bar plot, we used pandas isnull() and mean(), followed by plot.bar(), and modified the plot by adding axis legends and a title with the xlabel(), ylabel(), and title() Matplotlib methods.

Python Feature Engineering Cookbook

By : Soledad Galli

Python Feature Engineering Cookbook

By: Soledad Galli

Overview of this book

Related Content you might be interested in

Current Title:

Python Feature Engineering Cookbook

Data Preprocessing with Python for Absolute Beginners

Ensemble Machine Learning Cookbook

The Data Science Workshop