Imputing categorical features
Now that we have an understanding of the data we are working with, let's take a look at our missing values:
- To do this, we can use the
isnull
method available to us in pandas for DataFrames. This method returns aboolean
same-sized object indicating if the values are null. - We will then
sum
these to see which columns have missing data:
X.isnull().sum() >>>> boolean 1 city 1 ordinal_column 0 quantitative_column 1 dtype: int64
Here, we can see that three of our columns are missing values. Our course of action will be to impute these missing values.
If you recall, we implemented scikit-learn's Imputer
class in a previous chapter to fill in numerical data. Imputer
does have a categorical option, most_frequent
, however it only works on categorical data that has been encoded as integers.
We may not always want to transform our categorical data this way, as it can change how we interpret the categorical information,...