Feature Engineering Made Easy

Feature Engineering Made Easy

By : Sinan Ozdemir, Divya Susarla

Buy this Book

Feature Engineering Made Easy

By: Sinan Ozdemir, Divya Susarla

Buy this Book

Overview of this book

Feature engineering is the most important step in creating powerful machine learning systems. This book will take you through the entire feature-engineering journey to make your machine learning much more systematic and effective. You will start with understanding your data—often the success of your ML models depends on how you leverage different feature types, such as continuous, categorical, and more, You will learn when to include a feature, when to omit it, and why, all by understanding error analysis and the acceptability of your models. You will learn to convert a problem statement into useful new features. You will learn to deliver features driven by business needs as well as mathematical insights. You'll also learn how to use machine learning on your machines, automatically learning amazing features for your data. By the end of the book, you will become proficient in Feature Selection, Feature Learning, and Feature Optimization.

Title Page

Packt Upsell

Contributors

Preface

Free Chapter

Introduction to Feature Engineering

Motivating example – AI-powered communications

Why feature engineering matters

What is feature engineering?

Evaluation of machine learning algorithms and feature engineering procedures

Feature understanding – what’s in my dataset?

Feature improvement – cleaning datasets

Feature selection – say no to bad attributes

Feature construction – can we build it?

Feature transformation – enter math-man

Feature learning – using AI to better our AI

Summary

Feature Understanding – What's in My Dataset?

The structure, or lack thereof, of data

An example of unstructured data – server logs

Quantitative versus qualitative data

The four levels of data

Recap of the levels of data

Summary

Feature Improvement - Cleaning Datasets

Identifying missing values in data

Dealing with missing values in a dataset

Standardization and normalization

Summary

Feature Construction

Examining our dataset

Imputing categorical features

Encoding categorical variables

Extending numerical features

Text-specific feature construction

Summary

Feature Selection

Achieving better performance in feature engineering

Creating a baseline machine learning pipeline

The types of feature selection

Choosing the right feature selection method

Summary

Feature Transformations

Dimension reduction – feature transformations versus feature selection versus feature construction

Principal Component Analysis

Scikit-learn's PCA

How centering and scaling data affects PCA

A deeper look into the principal components

Linear Discriminant Analysis

LDA versus PCA – iris dataset

Summary

Feature Learning

Parametric assumptions of data

Restricted Boltzmann Machines

The BernoulliRBM

Extracting RBM components from MNIST

Using RBMs in a machine learning pipeline

Learning text features – word vectorizations

Summary

Case Studies

Case study 1 - facial recognition

Case study 2 - predicting topics of hotel reviews data

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

What is feature engineering?

Finally, the title of the book.

Yes, folks, feature engineering will be the topic of this book. We will be focusing on the process of cleaning and organizing data for the purposes of machine learning pipelines. We will also go beyond these concepts and look at more complex transformations of data in the forms of mathematical formulas and neural understanding, but we are getting ahead of ourselves. Let’s start a high level.

Note

Feature engineering is the process of transforming data into features that better represent the underlying problem, resulting in improved machine learning performance.

To break this definition down a bit further, let's look at precisely what feature engineering entails:

Process of transforming data: Note that we are not specifying raw data, unfiltered data, and so on. Feature engineering can be applied to data at any stage. Oftentimes, we will be applying feature engineering techniques to data that is already processed in the eyes of the data distributor. It is also important to mention that the data that we will be working with will usually be in a tabular format. The data will be organized into rows (observations) and columns (attributes). There will be times when we will start with data at its most raw form, such as in the examples of the server logs mentioned previously, but for the most part, we will deal with data already somewhat cleaned and organized.
Features: The word features will obviously be used a lot in this book. At its most basic level, a feature is an attribute of data that is meaningful to the machine learning process. Many times we will be diagnosing tabular data and identifying which columns are features and which are merely attributes.
Better represent the underlying problem: The data that we will be working with will always serve to represent a specific problem in a specific domain. It is important to ensure that while we are performing these techniques, we do not lose sight of the bigger picture. We want to transform data so that it better represents the bigger problem at hand.
Resulting in improved machine learning performance: Feature engineering exists as a single part of the process of data science. As we saw, it is an important and oftentimes undervalued part. The eventual goal of feature engineering is to obtain data that our learning algorithms will be able to extract patterns from and use in order to obtain better results. We will talk in depth about machine learning metrics and results later on in this book, but for now, know that we perform feature engineering not only to obtain cleaner data, but to eventually use that data in our machine learning pipelines.

We know what you’re thinking, why should I spend my time reading about a process that people say they do not enjoy doing? We believe that many people do not enjoy the process of feature engineering because they often do not have the benefits of understanding the results of the work that they do.

Most companies employ both data engineers and machine learning engineers. The data engineers are primarily concerned with the preparation and transformation of the data, while the machine learning engineers usually have a working knowledge of learning algorithms and how to mine patterns from already cleaned data.

Their jobs are often separate but intertwined and iterative. The data engineers will present a dataset for the machine learning engineers, which they will claim they cannot get good results from, and ask the Data Engineers to try to transform the data further, and so on, and so forth. This process can not only be monotonous and repetitive, it can also hurt the bigger picture.

Without having knowledge of both feature and machine learning engineering, the entire process might not be as effective as it could be. That’s where this book comes in. We will be talking about feature engineering and how it relates directly to machine learning. It will be a results-driven approach where we will deem techniques as helpful if, and only if, they can lead to a boost in performance. It is worth now diving a bit into the basics of data, the structure of data, and machine learning, to ensure standardization of terminology.

Understanding the basics of data and machine learning

When we talk about data, we are generally dealing with tabular data, that is, data that is organized into rows and columns. Think of this as being able to be opened in a spreadsheet technology such as Microsoft Excel. Each row of data, otherwise known as an observation, represents a single instance/example of a problem. If our data belongs to the domain of day-trading in the stock market, an observation might represent an hour’s worth of changes in the overall market and price.

For example, when dealing with the domain of network security, an observation could represent a possible attack or a packet of data sent over a wireless system.

The following shows sample tabular data in the domain of cyber security and more specifically, network intrusion:

DateTime	Protocol	Urgent	Malicious
June 2nd, 2018	TCP	FALSE	TRUE
June 2nd, 2018	HTTP	TRUE	TRUE
June 2nd, 2018	HTTP	TRUE	FALSE
June 3rd, 2018	HTTP	FALSE	TRUE

We see that each row or observation consists of a network connection and we have four attributes of the observation:DateTime, Protocol, Urgent, and Malicious. While we will not dive into these specific attributes, we will simply notice the structure of the data given to us in a tabular format.

Because we will, for the most part, consider our data to be tabular, we can also look at specific instances where the matrix of data has only one column/attribute. For example, if we are building a piece of software that is able to take in a single image of a room and output whether or not there is a human in that room. The data for the input might be represented as a matrix of a single column where the single column is simply a URL to a photo of a room and nothing else.

For example, considering the following table of table that has only a single column titled, Photo URL. The values of the table are URLs (these are fake and do not lead anywhere and are purely for example) of photos that are relevant to the data scientist:

Photo URL

http://photo-storage.io/room/1

http://photo-storage.io/room/2

http://photo-storage.io/room/3

http://photo-storage.io/room/4

The data that is inputted into the system might only be a single column, such as in this case. In our ability to create a system that can analyze images, the input might simply be a URL to the image in question. It would be up to us as data scientists to engineer features from the URL.

As data scientists, we must be ready to ingest and handle data that might be large, small, wide, narrow (in terms of attributes), sparse in completion (there might be missing values), and be ready to utilize this data for the purposes of machine learning.Now’s a good time to talk more about that. Machine learning algorithms belong to a class of algorithms that are defined by their ability to extract and exploit patterns in data to accomplish a task based on historical training data. Vague, right? machine learning can handle many types of tasks, and therefore we will leave the definition of machine learning as is and dive a bit deeper.

We generally separate machine learning into two main types, supervised and unsupervised learning. Each type of machine learning algorithm can benefit from feature engineering, and therefore it is important that we understand each type.

Supervised learning

Oftentimes, we hear about feature engineering in the specific context of supervised learning, otherwise known as predictive analytics. Supervised learning algorithms specifically deal with the task of predicting a value, usually one of the attributes of the data, using the other attributes of the data. Take, for example, the dataset representing the network intrusion:

DateTime	Protocol	Urgent	Malicious
June 2nd, 2018	TCP	FALSE	TRUE
June 2nd, 2018	HTTP	TRUE	TRUE
June 2nd, 2018	HTTP	TRUE	FALSE
June 3rd, 2018	HTTP	FALSE	TRUE

This is the same dataset as before, but let's dissect it further in the context of predictive analytics.

Notice that we have four attributes of this dataset:DateTime, Protocol, Urgent, and Malicious. Suppose now that the malicious attribute contains values that represent whether or not the observation was a malicious intrusion attempt. So in our very small dataset of four network connections, the first, second, and fourth connection were malicious attempts to intrude a network.

Suppose further that given this dataset, our task is to be able to take in three of the attributes (datetime, protocol, and urgent) and be able to accurately predict the value of malicious. In laymen’s terms, we want a system that can map the values of datetime, protocol, and urgent to the values in malicious. This is exactly how a supervised learning problem is set up:

Network_features = pd.DataFrame({'datetime': ['6/2/2018', '6/2/2018', '6/2/2018', '6/3/2018'], 'protocol': ['tcp', 'http', 'http', 'http'], 'urgent': [False, True, True, False]})
Network_response = pd.Series([True, True, False, True])
Network_features
>>
 datetime protocol  urgent
0  6/2/2018      tcp   False
1  6/2/2018     http    True
2  6/2/2018     http    True
3  6/3/2018     http   False
Network_response
>>
 0     True
1     True
2    False
3     True
dtype: bool

When we are working with supervised learning, we generally call the attribute (usually only one of them, but that is not necessary) of the dataset that we are attempting to predict the response of. The remaining attributes of the dataset are then called the features.

Supervised learning can also be considered the class of algorithms attempting to exploit the structure in data. By this, we mean that the machine learning algorithms try to extract patterns in usually very nice and neat data. As discussed earlier, we should not always expect data to come in tidy; this is where feature engineering comes in.

But if we are not predicting something, what good is machine learning you may ask? I’m glad you did. Before machine learning can exploit the structure of data, sometimes we have to alter or even create structure. That’s where unsupervised learning becomes a valuable tool.

Unsupervised learning

Supervised learning is all about making predictions. We utilize features of the data and use them to make informative predictions about the response of the data. If we aren’t making predictions by exploring structure, we are attempting to extract structure from our data. We generally do so by applying mathematical transformations to numerical matrix representations of data or iterative procedures to obtain new sets of features.

This concept can be a bit more difficult to grasp than supervised learning, and so I will present a motivating example to help elucidate how this all works.

Unsupervised learning example – marketing segments

Suppose we are given a large (one million rows) dataset where each row/observation is a single person with basic demographic information (age, gender, and so on) as well as the number of items purchased, which represents how many items this person has bought from a particular store:

Age	Gender	Number of items purchased
25	F	1
28	F	23
61	F	3
54	M	17
51	M	8
47	F	3
27	M	22
31	F	14

This is a sample of our marketing dataset where each row represents a single customer with three basic attributes about each person. Our goal will be to segment this dataset into types or clusters of people so that the company performing the analysis can understand the customer profiles much better.

Now, of course, We’ve only shown 8 out of one million rows, which can be daunting. Of course, we can perform basic descriptive statistics on this dataset and get averages, standard deviations, and so on of our numerical columns; however, what if we wished to segment these one million people into different types so that the marketing department can have a much better sense of the types of people who shop and create more appropriate advertisements for each segment?

Each type of customer would exhibit particular qualities that make that segment unique. For example, they may find that 20% of their customers fall into a category they like to call young and wealthy that are generally younger and purchase several items.

This type of analysis and the creation of these types can fall under a specific type of unsupervised learning called clustering. We will discuss this machine learning algorithm in further detail later on in this book, but for now, clustering will create a new feature that separates out the people into distinct types or clusters:

Age	Gender	Number of items purchased	Cluster
25	F	1	6
28	F	23	1
61	F	3	3
54	M	17	2
51	M	8	3
47	F	3	8
27	M	22	5
31	F	14	1

This shows our customer dataset after a clustering algorithm has been applied. Note the new column at the end calledclusterthat represents the types of people that the algorithm has identified. The idea is that the people who belong to similar clusters behave similarly in regards to the data (have similar ages, genders, purchase behaviors). Perhaps cluster six might be renamed as young buyers.

This example of clustering shows us why sometimes we aren’t concerned with predicting anything, but instead wish to understand our data on a deeper level by adding new and interesting features, or even removing irrelevant features.

Note

Note that we are referring to every column as a feature because there is no response in unsupervised learning since there is no prediction occurring.

It’s all starting to make sense now, isn’t it? These features that we talk about repeatedly are what this book is primarily concerned with. Feature engineering involves the understanding and transforming of features in relation to both unsupervised and supervised learning.

Feature Engineering Made Easy

By : Sinan Ozdemir, Divya Susarla

Feature Engineering Made Easy

By: Sinan Ozdemir, Divya Susarla

Overview of this book

Related Content you might be interested in

Current Title:

Feature Engineering Made Easy

Principles of Data Science

Python Data Mining Quick Start Guide

scikit-learn Cookbook

What is feature engineering?

Note

Understanding the basics of data and machine learning

Supervised learning

Unsupervised learning

Unsupervised learning example – marketing segments

Note