Python Feature Engineering Cookbook

By : Soledad Galli

Python Feature Engineering Cookbook

By: Soledad Galli

Overview of this book

Feature engineering is invaluable for developing and enriching your machine learning models. In this cookbook, you will work with the best tools to streamline your feature engineering pipelines and techniques and simplify and improve the quality of your code. Using Python libraries such as pandas, scikit-learn, Featuretools, and Feature-engine, you’ll learn how to work with both continuous and discrete datasets and be able to transform features from unstructured datasets. You will develop the skills necessary to select the best features as well as the most suitable extraction techniques. This book will cover Python recipes that will help you automate feature engineering to simplify complex processes. You’ll also get to grips with different feature engineering strategies, such as the box-cox transform, power transform, and log transform across machine learning, reinforcement learning, and natural language processing (NLP) domains. By the end of this book, you’ll have discovered tips and practical solutions to all of your feature engineering problems.

Preface

Who this book is for

What this book covers

To get the most out of this book

Sections

Get in touch

Foreseeing Variable Problems When Building ML Models

Technical requirements

Identifying numerical and categorical variables

Quantifying missing data

Determining cardinality in categorical variables

Pinpointing rare categories in categorical variables

Identifying a linear relationship

Identifying a normal distribution

Distinguishing variable distribution

Highlighting outliers

Comparing feature magnitude

Free Chapter

Imputing Missing Data

Technical requirements

Removing observations with missing data

Performing mean or median imputation

Implementing mode or frequent category imputation

Replacing missing values with an arbitrary number

Capturing missing values in a bespoke category

Replacing missing values with a value at the end of the distribution

Implementing random sample imputation

Adding a missing value indicator variable

Performing multivariate imputation by chained equations

Assembling an imputation pipeline with scikit-learn

Assembling an imputation pipeline with Feature-engine

Encoding Categorical Variables

Technical requirements

Creating binary variables through one-hot encoding

Performing one-hot encoding of frequent categories

Replacing categories with ordinal numbers

Replacing categories with counts or frequency of observations

Encoding with integers in an ordered manner

Encoding with the mean of the target

Encoding with the Weight of Evidence

Grouping rare or infrequent categories

Performing binary encoding

Performing feature hashing

Transforming Numerical Variables

Technical requirements

Transforming variables with the logarithm

Transforming variables with the reciprocal function

Using square and cube root to transform variables

Using power transformations on numerical variables

Performing Box-Cox transformation on numerical variables

Performing Yeo-Johnson transformation on numerical variables

Performing Variable Discretization

Technical requirements

Dividing the variable into intervals of equal width

Sorting the variable values in intervals of equal frequency

Performing discretization followed by categorical encoding

Allocating the variable values in arbitrary intervals

Performing discretization with k-means clustering

Using decision trees for discretization

Working with Outliers

Technical requirements

Trimming outliers from the dataset

Performing winsorization

Capping the variable at arbitrary maximum and minimum values

Performing zero-coding – capping the variable at zero

Deriving Features from Dates and Time Variables

Technical requirements

Extracting date and time parts from a datetime variable

Deriving representations of the year and month

Creating representations of day and week

Extracting time parts from a time variable

Capturing the elapsed time between datetime variables

Working with time in different time zones

Performing Feature Scaling

Technical requirements

Standardizing the features

Performing mean normalization

Scaling to the maximum and minimum values

Implementing maximum absolute scaling

Scaling with the median and quantiles

Scaling to vector unit length

Applying Mathematical Computations to Features

Technical requirements

Combining multiple features with statistical operations

Combining pairs of features with mathematical functions

Performing polynomial expansion

Deriving new features with decision trees

Carrying out PCA

Creating Features with Transactional and Time Series Data

Technical requirements

Aggregating transactions with mathematical operations

Aggregating transactions in a time window

Determining the number of local maxima and minima

Deriving time elapsed between time-stamped events

Creating features from transactions with Featuretools

Extracting Features from Text Variables

Technical requirements

Counting characters, words, and vocabulary

Estimating text complexity by counting sentences

Creating features with bag-of-words and n-grams

Implementing term frequency-inverse document frequency

Cleaning and stemming text variables

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Identifying numerical and categorical variables

Numerical variables can be discrete or continuous. Discrete variables are those where the pool of possible values is finite and are generally whole numbers, such as 1, 2, and 3. Examples of discrete variables include the number of children, number of pets, or the number of bank accounts. Continuous variables are those whose values may take any number within a range. Examples of continuous variables include the price of a product, income, house price, or interest rate. Categorical variables are values that are selected from a group of categories, also called labels. Examples of categorical variables include gender, which takes values of male and female, or country of birth, which takes values of Argentina, Germany, and so on.

In this recipe, we will learn how to identify continuous, discrete, and categorical variables by inspecting their values and the data type that they are stored and loaded with in pandas.

Getting ready

Discrete variables are usually of the int type, continuous variables are usually of the float type, and categorical variables are usually of the object type when they're stored in pandas. However, discrete variables can also be cast as floats, while numerical variables can be cast as objects. Therefore, to correctly identify variable types, we need to look at the data type and inspect their values as well. Make sure you have the correct library versions installed and that you've downloaded a copy of the Titanic dataset, as described in the Technical requirements section.

How to do it...

First, let's import the necessary Python libraries:

Load the libraries that are required for this recipe:

import pandas as pd
import matplotlib.pyplot as plt

Load the Titanic dataset and inspect the variable types:

data = pd.read_csv('titanic.csv')
data.dtypes

The variable types are as follows:

pclass         int64
survived       int64
name          object
sex           object
age          float64
sibsp          int64
parch          int64
ticket        object
fare         float64
cabin         object
embarked      object
boat          object
body         float64
home.dest     object
dtype: object

In many datasets, integer variables are cast as float. So, after inspecting the data type of the variable, even if you get float as output, go ahead and check the unique values to make sure that those variables are discrete and not continuous.

Inspect the distinct values of the sibsp discrete variable:

data['sibsp'].unique()

The possible values that sibsp can take can be seen in the following code:

array([0, 1, 2, 3, 4, 5, 8], dtype=int64)

Now, let's inspect the first 20 distinct values of the continuous variable fare:

data['fare'].unique()[0:20]

The following code block identifies the unique values of fare and displays the first 20:

array([211.3375, 151.55  ,  26.55  ,  77.9583,   0.    ,  51.4792,
        49.5042, 227.525 ,  69.3   ,  78.85  ,  30.    ,  25.925 ,
       247.5208,  76.2917,  75.2417,  52.5542, 221.7792,  26.    ,
        91.0792, 135.6333])

Go ahead and inspect the values of the embarked and cabin variables by using the command we used in step 3 and step 4.

The embarked variable contains strings as values, which means it's categorical, whereas cabin contains a mix of letters and numbers, which means it can be classified as a mixed type of variable.

How it works...

In this recipe, we identified the variable data types of a publicly available dataset by inspecting the data type in which the variables are cast and the distinct values they take. First, we used pandas read_csv() to load the data from a CSV file into a dataframe. Next, we used pandas dtypes to display the data types in which the variables are cast, which can be float for continuous variables, int for integers, and object for strings. We observed that the continuous variable fare was cast as float, the discrete variable sibsp was cast as int, and the categorical variable embarked was cast as an object. Finally, we identified the distinct values of a variable with the unique() method from pandas. We used unique() together with a range, [0:20], to output the first 20 unique values for fare, since this variable shows a lot of distinct values.

There's more...

To understand whether a variable is continuous or discrete, we can also make a histogram:

Let's make a histogram for the sibsp variable by dividing the variable value range into 20 intervals:

data['sibsp'].hist(bins=20)

The output of the preceding code is as follows:

Note how the histogram of a discrete variable has a broken, discrete shape.

Now, let's make a histogram of the fare variable by sorting the values into 50 contiguous intervals:

data['fare'].hist(bins=50)

The output of the preceding code is as follows:

The histogram of continuous variables shows values throughout the variable value range.

Python Feature Engineering Cookbook

By : Soledad Galli

Python Feature Engineering Cookbook

By: Soledad Galli

Overview of this book

Related Content you might be interested in

Current Title:

Python Feature Engineering Cookbook

Data Preprocessing with Python for Absolute Beginners

Ensemble Machine Learning Cookbook

The Data Science Workshop