Let's begin by importing the required libraries and tools and preparing the dataset:
- Let's import pandas, the train_test_split function from scikit-learn, and RandomSampleImputer from Feature-engine:
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.missing_data_imputers import RandomSampleImputer
- Let's load the dataset:
data = pd.read_csv('creditApprovalUCI.csv')
- The random values that will be used to replace missing data should be extracted from the train set, so let's separate the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split(
data.drop('A16', axis=1), data['A16'], test_size=0.3,
random_state=0)
First, we will run the commands line by line to understand their output. Then, we will execute them in a loop to impute several variables. In random sample imputation, we extract as many random values as there is missing data in the variable.
- Let's calculate the number of missing values in the A2 variable:
number_na = X_train['A2'].isnull().sum()
- If you print the number_na variable, you will obtain 11 as output, which is the number of missing values in A2. Thus, let's extract 11 values at random from A2 for the imputation:
random_sample_train = X_train['A2'].dropna().sample(number_na,
random_state=0)
- We can only use one pandas Series to replace values in another pandas Series if their indexes are identical, so let's re-index the extracted random values so that they match the index of the missing values in the original dataframe:
random_sample_train.index = X_train[X_train['A2'].isnull()].index
- Now, let's replace the missing values in the original dataset with randomly extracted values:
X_train.loc[X_train['A2'].isnull(), 'A2'] = random_sample_train
- Now, let's combine step 4 to step 7 in a loop to replace the missing data in the variables in various train and test sets:
for var in ['A1', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8']:
# extract a random sample
random_sample_train = X_train[var].dropna().sample(
X_train[var].isnull().sum(), random_state=0)
random_sample_test = X_train[var].dropna().sample(
X_test[var].isnull().sum(), random_state=0)
# re-index the randomly extracted sample
random_sample_train.index = X_train[
X_train[var].isnull()].index
random_sample_test.index = X_test[X_test[var].isnull()].index
# replace the NA
X_train.loc[X_train[var].isnull(), var] = random_sample_train
X_test.loc[X_test[var].isnull(), var] = random_sample_test
Note how we always extract values from the train set, but we calculate the number of missing values and the index using the train or test sets, respectively.
To finish, let's impute missing values using Feature-engine. First, we need to separate the data into train and test, just like we did in step 3 of this recipe.
- Next, let's set up RandomSamplemputer() and fit it to the train set:
imputer = RandomSampleImputer()
imputer.fit(X_train)
RandomSampleImputer() will replace the values in all variables in the dataset by default.
We can specify the variables to impute by passing variable names in a list to the imputer using imputer = RandomSampleImputer(variables = ['A2', 'A3']).
- Finally, let's replace the missing values:
To obtain reproducibility between code runs, we can set the random_state to a number when we initialize the RandomSampleImputer(). It will use the random_state at each run of the transform() method.