Book Image

Feature Engineering Made Easy

By : Sinan Ozdemir, Divya Susarla
Book Image

Feature Engineering Made Easy

By: Sinan Ozdemir, Divya Susarla

Overview of this book

Feature engineering is the most important step in creating powerful machine learning systems. This book will take you through the entire feature-engineering journey to make your machine learning much more systematic and effective. You will start with understanding your data—often the success of your ML models depends on how you leverage different feature types, such as continuous, categorical, and more, You will learn when to include a feature, when to omit it, and why, all by understanding error analysis and the acceptability of your models. You will learn to convert a problem statement into useful new features. You will learn to deliver features driven by business needs as well as mathematical insights. You'll also learn how to use machine learning on your machines, automatically learning amazing features for your data. By the end of the book, you will become proficient in Feature Selection, Feature Learning, and Feature Optimization.
Table of Contents (14 chapters)
Title Page
Copyright and Credits
Packt Upsell
Contributors
Preface

Feature construction – can we build it?


While in previous chapters we focused heavily on removing features that were not helping us with our machine learning pipelines, this chapter will look at techniques in creating brand new features and placing them correctly within our dataset. These new features will ideally hold new information and generate new patterns that ML pipelines will be able to exploit and use to increase performance.

These created features can come from many places. Oftentimes, we will create new features out of existing features given to us. We can create new features by applying transformations to existing features and placing the resulting vectors alongside their previous counterparts. We will also look at adding new features from separate party systems. As an example, if we are working with data attempting to cluster people based on shopping behaviors, then we might benefit from adding in census data that is separate from the corporation and their purchasing data. However, this will present a few problems:

  • If the census is aware of 1,700 Jon does and the corporation only knows 13, how do we know which of the 1,700 people match up to the 13? This is called entity matching
  • The census data would be quite large and entity matching would take a very long time

These problems and more make for a fairly difficult procedure but oftentimes create a very dense and data-rich environment.

In this chapter, we will take some time to talk about the manual creation of features through highly unstructured data. Two big examples are text and images. These pieces of data by themselves are incomprehensible to machine learning and artificial intelligence pipelines, so it is up to us to manually create features that represent the images/pieces of text. As a simple example, imagine that we are making the basics of a self-driving car and to start, we want to make a model that can take in an image of what the car is seeing in front of it and decide whether or not it should stop. The raw image is not good enough because a machine learning algorithm would have no idea what to do with it. We have to manually construct features out of it. Given this raw image, we can split it up in a few ways:

  • We could consider the color intensity of each pixel and consider each pixel an attribute:
    • For example, if the camera of the car produces images of 2,048 x 1,536 pixels, we would have 3,145,728 columns
  • We could consider each row of pixels as an attribute and the average color of each row being the value:
    • In this case, there would only be 1,536 rows
  • We could project this image into space where features represent objects within the image. This is the hardest of the three and would look something like this:

Stop sign

Cat

Sky

Road

Patches of grass

Submarine

1

0

1

1

4

0

 

Where each feature is an object that may or may not be within the image and the value represents the number of times that object appears in the image. If a model were given this information, it would be a fairly good idea to stop!