Book Image

Feature Engineering Made Easy

By : Sinan Ozdemir, Divya Susarla
Book Image

Feature Engineering Made Easy

By: Sinan Ozdemir, Divya Susarla

Overview of this book

Feature engineering is the most important step in creating powerful machine learning systems. This book will take you through the entire feature-engineering journey to make your machine learning much more systematic and effective. You will start with understanding your data—often the success of your ML models depends on how you leverage different feature types, such as continuous, categorical, and more, You will learn when to include a feature, when to omit it, and why, all by understanding error analysis and the acceptability of your models. You will learn to convert a problem statement into useful new features. You will learn to deliver features driven by business needs as well as mathematical insights. You'll also learn how to use machine learning on your machines, automatically learning amazing features for your data. By the end of the book, you will become proficient in Feature Selection, Feature Learning, and Feature Optimization.
Table of Contents (14 chapters)
Title Page
Copyright and Credits
Packt Upsell
Contributors
Preface

Creating a baseline machine learning pipeline


In previous chapters, we offered to you, the reader, a single machine learning model to use throughout the chapter. In this chapter, we will do some work to find the best machine learning model for our needs and then work to enhance that model with feature selection. We will begin by importing four different machine learning models:

  • Logistic Regression
  • K-Nearest Neighbors
  • Decision Tree
  • Random Forest

The code for importing the learning models is given as follows:

# Import four machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

Once we are finished importing these modules, we will run them through our get_best_model_and_accuracy functions to get a baseline on how each one handles the raw data. We will have to first establish some variables to do so. We will use the following code...