Feature Engineering Made Easy

Feature Engineering Made Easy

By : Sinan Ozdemir, Divya Susarla

Buy this Book

Feature Engineering Made Easy

By: Sinan Ozdemir, Divya Susarla

Buy this Book

Overview of this book

Feature engineering is the most important step in creating powerful machine learning systems. This book will take you through the entire feature-engineering journey to make your machine learning much more systematic and effective. You will start with understanding your data—often the success of your ML models depends on how you leverage different feature types, such as continuous, categorical, and more, You will learn when to include a feature, when to omit it, and why, all by understanding error analysis and the acceptability of your models. You will learn to convert a problem statement into useful new features. You will learn to deliver features driven by business needs as well as mathematical insights. You'll also learn how to use machine learning on your machines, automatically learning amazing features for your data. By the end of the book, you will become proficient in Feature Selection, Feature Learning, and Feature Optimization.

Title Page

Packt Upsell

Contributors

Preface

Free Chapter

Introduction to Feature Engineering

Motivating example – AI-powered communications

Why feature engineering matters

What is feature engineering?

Evaluation of machine learning algorithms and feature engineering procedures

Feature understanding – what’s in my dataset?

Feature improvement – cleaning datasets

Feature selection – say no to bad attributes

Feature construction – can we build it?

Feature transformation – enter math-man

Feature learning – using AI to better our AI

Summary

Feature Understanding – What's in My Dataset?

The structure, or lack thereof, of data

An example of unstructured data – server logs

Quantitative versus qualitative data

The four levels of data

Recap of the levels of data

Summary

Feature Improvement - Cleaning Datasets

Identifying missing values in data

Dealing with missing values in a dataset

Standardization and normalization

Summary

Feature Construction

Examining our dataset

Imputing categorical features

Encoding categorical variables

Extending numerical features

Text-specific feature construction

Summary

Feature Selection

Achieving better performance in feature engineering

Creating a baseline machine learning pipeline

The types of feature selection

Choosing the right feature selection method

Summary

Feature Transformations

Dimension reduction – feature transformations versus feature selection versus feature construction

Principal Component Analysis

Scikit-learn's PCA

How centering and scaling data affects PCA

A deeper look into the principal components

Linear Discriminant Analysis

LDA versus PCA – iris dataset

Summary

Feature Learning

Parametric assumptions of data

Restricted Boltzmann Machines

The BernoulliRBM

Extracting RBM components from MNIST

Using RBMs in a machine learning pipeline

Learning text features – word vectorizations

Summary

Case Studies

Case study 1 - facial recognition

Case study 2 - predicting topics of hotel reviews data

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Feature construction – can we build it?

While in previous chapters we focused heavily on removing features that were not helping us with our machine learning pipelines, this chapter will look at techniques in creating brand new features and placing them correctly within our dataset. These new features will ideally hold new information and generate new patterns that ML pipelines will be able to exploit and use to increase performance.

These created features can come from many places. Oftentimes, we will create new features out of existing features given to us. We can create new features by applying transformations to existing features and placing the resulting vectors alongside their previous counterparts. We will also look at adding new features from separate party systems. As an example, if we are working with data attempting to cluster people based on shopping behaviors, then we might benefit from adding in census data that is separate from the corporation and their purchasing data. However, this will present a few problems:

If the census is aware of 1,700 Jon does and the corporation only knows 13, how do we know which of the 1,700 people match up to the 13? This is called entity matching
The census data would be quite large and entity matching would take a very long time

These problems and more make for a fairly difficult procedure but oftentimes create a very dense and data-rich environment.

In this chapter, we will take some time to talk about the manual creation of features through highly unstructured data. Two big examples are text and images. These pieces of data by themselves are incomprehensible to machine learning and artificial intelligence pipelines, so it is up to us to manually create features that represent the images/pieces of text. As a simple example, imagine that we are making the basics of a self-driving car and to start, we want to make a model that can take in an image of what the car is seeing in front of it and decide whether or not it should stop. The raw image is not good enough because a machine learning algorithm would have no idea what to do with it. We have to manually construct features out of it. Given this raw image, we can split it up in a few ways:

We could consider the color intensity of each pixel and consider each pixel an attribute:
- For example, if the camera of the car produces images of 2,048 x 1,536 pixels, we would have 3,145,728 columns
We could consider each row of pixels as an attribute and the average color of each row being the value:
- In this case, there would only be 1,536 rows
We could project this image into space where features represent objects within the image. This is the hardest of the three and would look something like this:

Stop sign	Cat	Sky	Road	Patches of grass	Submarine
1	0	1	1	4	0

Where each feature is an object that may or may not be within the image and the value represents the number of times that object appears in the image. If a model were given this information, it would be a fairly good idea to stop!

Feature Engineering Made Easy

By : Sinan Ozdemir, Divya Susarla

Feature Engineering Made Easy

By: Sinan Ozdemir, Divya Susarla

Overview of this book

Related Content you might be interested in

Current Title:

Feature Engineering Made Easy

Principles of Data Science

Python Data Mining Quick Start Guide

scikit-learn Cookbook

Feature construction – can we build it?