Book Image

Feature Engineering Made Easy

By : Sinan Ozdemir, Divya Susarla
Book Image

Feature Engineering Made Easy

By: Sinan Ozdemir, Divya Susarla

Overview of this book

Feature engineering is the most important step in creating powerful machine learning systems. This book will take you through the entire feature-engineering journey to make your machine learning much more systematic and effective. You will start with understanding your data—often the success of your ML models depends on how you leverage different feature types, such as continuous, categorical, and more, You will learn when to include a feature, when to omit it, and why, all by understanding error analysis and the acceptability of your models. You will learn to convert a problem statement into useful new features. You will learn to deliver features driven by business needs as well as mathematical insights. You'll also learn how to use machine learning on your machines, automatically learning amazing features for your data. By the end of the book, you will become proficient in Feature Selection, Feature Learning, and Feature Optimization.
Table of Contents (14 chapters)
Title Page
Copyright and Credits
Packt Upsell
Contributors
Preface

Imputing categorical features


Now that we have an understanding of the data we are working with, let's take a look at our missing values:

  • To do this, we can use the isnull method available to us in pandas for DataFrames. This method returns a boolean same-sized object indicating if the values are null.
  • We will then sum these to see which columns have missing data:
X.isnull().sum()
>>>>
boolean                1
city                   1
ordinal_column         0
quantitative_column    1
dtype: int64

Here, we can see that three of our columns are missing values. Our course of action will be to impute these missing values.

If you recall, we implemented scikit-learn's Imputer class in a previous chapter to fill in numerical data. Imputer does have a categorical option, most_frequent, however it only works on categorical data that has been encoded as integers.

We may not always want to transform our categorical data this way, as it can change how we interpret the categorical information,...