Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Python Feature Engineering Cookbook
  • Table Of Contents Toc
Python Feature Engineering Cookbook

Python Feature Engineering Cookbook

By : Soledad Galli
3.6 (9)
close
close
Python Feature Engineering Cookbook

Python Feature Engineering Cookbook

3.6 (9)
By: Soledad Galli

Overview of this book

Feature engineering is invaluable for developing and enriching your machine learning models. In this cookbook, you will work with the best tools to streamline your feature engineering pipelines and techniques and simplify and improve the quality of your code. Using Python libraries such as pandas, scikit-learn, Featuretools, and Feature-engine, you’ll learn how to work with both continuous and discrete datasets and be able to transform features from unstructured datasets. You will develop the skills necessary to select the best features as well as the most suitable extraction techniques. This book will cover Python recipes that will help you automate feature engineering to simplify complex processes. You’ll also get to grips with different feature engineering strategies, such as the box-cox transform, power transform, and log transform across machine learning, reinforcement learning, and natural language processing (NLP) domains. By the end of this book, you’ll have discovered tips and practical solutions to all of your feature engineering problems.
Table of Contents (13 chapters)
close
close

Implementing random sample imputation

Random sampling imputation consists of extracting random observations from the pool of available values in the variable. Random sampling imputation preserves the original distribution, which differs from the other imputation techniques we've discussed in this chapter and is suitable for numerical and categorical variables alike. In this recipe, we will implement random sample imputation with pandas and Feature-engine.

How to do it...

Let's begin by importing the required libraries and tools and preparing the dataset:

  1. Let's import pandas, the train_test_split function from scikit-learn, and RandomSampleImputer from Feature-engine:
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.missing_data_imputers import RandomSampleImputer
  1. Let's load the dataset:
data = pd.read_csv('creditApprovalUCI.csv')
  1. The random values that will be used to replace missing data should be extracted from the train set, so let's separate the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split(
data.drop('A16', axis=1), data['A16'], test_size=0.3,
random_state=0)

First, we will run the commands line by line to understand their output. Then, we will execute them in a loop to impute several variables. In random sample imputation, we extract as many random values as there is missing data in the variable.

  1. Let's calculate the number of missing values in the A2 variable:
number_na = X_train['A2'].isnull().sum()
  1. If you print the number_na variable, you will obtain 11 as output, which is the number of missing values in A2. Thus, let's extract 11 values at random from A2 for the imputation:
random_sample_train = X_train['A2'].dropna().sample(number_na, 
random_state=0)
  1. We can only use one pandas Series to replace values in another pandas Series if their indexes are identical, so let's re-index the extracted random values so that they match the index of the missing values in the original dataframe:
random_sample_train.index = X_train[X_train['A2'].isnull()].index
  1. Now, let's replace the missing values in the original dataset with randomly extracted values:
X_train.loc[X_train['A2'].isnull(), 'A2'] = random_sample_train
  1. Now, let's combine step 4 to step 7 in a loop to replace the missing data in the variables in various train and test sets:
for var in ['A1', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8']:

# extract a random sample
random_sample_train = X_train[var].dropna().sample(
X_train[var].isnull().sum(), random_state=0)

random_sample_test = X_train[var].dropna().sample(
X_test[var].isnull().sum(), random_state=0)

# re-index the randomly extracted sample
random_sample_train.index = X_train[
X_train[var].isnull()].index
random_sample_test.index = X_test[X_test[var].isnull()].index

# replace the NA
X_train.loc[X_train[var].isnull(), var] = random_sample_train
X_test.loc[X_test[var].isnull(), var] = random_sample_test
Note how we always extract values from the train set, but we calculate the number of missing values and the index using the train or test sets, respectively.

To finish, let's impute missing values using Feature-engine. First, we need to separate the data into train and test, just like we did in step 3 of this recipe.

  1. Next, let's set up RandomSamplemputer() and fit it to the train set:
imputer = RandomSampleImputer()
imputer.fit(X_train)
RandomSampleImputer() will replace the values in all variables in the dataset by default.

We can specify the variables to impute by passing variable names in a list to the imputer using imputer = RandomSampleImputer(variables = ['A2', 'A3']).
  1. Finally, let's replace the missing values:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)
To obtain reproducibility between code runs, we can set the random_state to a number when we initialize the RandomSampleImputer(). It will use the random_state at each run of the transform() method.

How it works...

In this recipe, we replaced missing values in the numerical and categorical variables of the Credit Approval Data Set with values extracted at random from the same variables using pandas and Feature-engine. First, we loaded the data and divided it into train and test sets using train_test_split(), as described in the Performing mean or median imputation recipe.

To perform random sample imputation using pandas, we calculated the number of missing values in the variable using pandas isnull(), followed by sum(). Next, we used pandas dropna() to drop missing information from the original variable in the train set so that we extracted values from observations with data using pandas sample(). We extracted as many observations as there was missing data in the variable to impute. Next, we re-indexed the pandas Series with the randomly extracted values so that we could assign those to the missing observations in the original dataframe. Finally, we replaced the missing values with values extracted at random using pandas' loc, which takes the location of the rows with missing data and the name of the column to which the new values are to be assigned as arguments.

We also carried out random sample imputation with RandomSampleImputer() from Feature-engine. With the fit() method, the RandomSampleImputer() stores a copy of the train set. With transform(), the imputer extracts values at random from the stored dataset and replaces the missing information with them, thereby returning complete pandas dataframes.

See also

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Python Feature Engineering Cookbook
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon