Book Image

Feature Engineering Made Easy

By : Sinan Ozdemir, Divya Susarla
Book Image

Feature Engineering Made Easy

By: Sinan Ozdemir, Divya Susarla

Overview of this book

Feature engineering is the most important step in creating powerful machine learning systems. This book will take you through the entire feature-engineering journey to make your machine learning much more systematic and effective. You will start with understanding your data—often the success of your ML models depends on how you leverage different feature types, such as continuous, categorical, and more, You will learn when to include a feature, when to omit it, and why, all by understanding error analysis and the acceptability of your models. You will learn to convert a problem statement into useful new features. You will learn to deliver features driven by business needs as well as mathematical insights. You'll also learn how to use machine learning on your machines, automatically learning amazing features for your data. By the end of the book, you will become proficient in Feature Selection, Feature Learning, and Feature Optimization.
Table of Contents (14 chapters)
Title Page
Copyright and Credits
Packt Upsell


This book will cover the topic of feature engineering. A huge part of the data science and machine learning pipeline, feature engineering includes the ability to identify, clean, construct, and discover new characteristics of data for the purpose of interpretation and predictive analysis.

In this book, we will be covering the entire process of feature engineering, from inspection to visualization, transformation, and beyond. We will be using both basic and advanced mathematical measures to transform our data into a form that's much more digestible by machines and machine learning pipelines.

By discovering and transforming, we, as data scientists, will be able to gain a whole new perspective on our data, enhancing not only our algorithms but also our insights.

Who this book is for

This book is for people who are looking to understand and utilize the practices of feature engineering for machine learning and data exploration.

The reader should be fairly well acquainted with machine learning and coding in Python to feel comfortable diving into new topics with a step-by-step explanation of the basics.

What this book covers

Chapter 1, Introduction to Feature Engineering, is an introduction to the basic terminology of feature engineering and a quick look at the types of problems we will be solving throughout this book.

Chapter 2, Feature Understanding – What's in My Dataset?, looks at the types of data we will encounter in the wild and how to deal with each one separately or together.

Chapter 3, Feature Improvement - Cleaning Datasets, explains various ways to fill in missing data and how different techniques lead to different structural changes in data that may lead to poorer machine learning performance.

Chapter 4, Feature Construction, is a look at how we can create new features based on what was already given to us in an effort to inflate the structure of data.

Chapter 5, Feature Selection, shows quantitative measures to decide which features are worthy of being kept in our data pipeline.

Chapter 6, Feature Transformations, uses advanced linear algebra and mathematical techniques to impose a rigid structure on data for the purpose of enhancing performance of our pipelines.

Chapter 7, Feature Learning, covers the use of state-of-the-art machine learning and artificial intelligence learning algorithms to discover latent features of our data that few humans could fathom.

Chapter 8, Case Studies, is an array of case studies shown in order to solidify the ideas of feature engineering.

To get the most out of this book

What do we require for this book:

  1. This book uses Python to complete all of its code examples. A machine (Linux/Mac/Windows is OK) with access to a Unix-style terminal and Python 2.7 installed is required.
  2. Installing the Anaconda distribution is also recommended as it comes with most of the packages used in the examples.

Download the example code files

You can download the example code files for this book from your account at If you purchased this book elsewhere, you can visit and register to have the files emailed directly to you.

You can download the code files by following these steps:

  1. Log in or register at
  2. Select the SUPPORT tab.
  3. Click on Code Downloads & Errata.
  4. Enter the name of the book in the Search box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR/7-Zip for Windows
  • Zipeg/iZip/UnRarX for Mac
  • 7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at We also have other code bundles from our rich catalog of books and videos available at Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here:

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Suppose further that given this dataset, our task is to be able to take in three of the attributes (datetimeprotocol, and urgent) and to be able to accurately predict the value of malicious. In layman's terms, we want a system that can map the values of datetimeprotocol, and urgent to the values in malicious."

A block of code is set as follows:

Network_features = pd.DataFrame({'datetime': ['6/2/2018', '6/2/2018', '6/2/2018', '6/3/2018'], 'protocol': ['tcp', 'http', 'http', 'http'], 'urgent': [False, True, True, False]})
Network_response = pd.Series([True, True, False, True])
 datetime protocol  urgent
0  6/2/2018      tcp   False
1  6/2/2018     http    True
2  6/2/2018     http    True
3  6/3/2018     http   False
 0     True
1     True
2    False
3     True
dtype: bool

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

times_pregnant                  0.221898
plasma_glucose_concentration    0.466581
diastolic_blood_pressure        0.065068
triceps_thickness               0.074752
serum_insulin                   0.130548
bmi                             0.292695
pedigree_function               0.173844
age                             0.238356
onset_diabetes                  1.000000
Name: onset_diabetes, dtype: float64

Bold: Indicates a new term, an important word, or words that you see onscreen. 


Warnings or important notes appear like this.


Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit


Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit