Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Python Feature Engineering Cookbook
  • Table Of Contents Toc
Python Feature Engineering Cookbook

Python Feature Engineering Cookbook

By : Soledad Galli
3.6 (9)
close
close
Python Feature Engineering Cookbook

Python Feature Engineering Cookbook

3.6 (9)
By: Soledad Galli

Overview of this book

Feature engineering is invaluable for developing and enriching your machine learning models. In this cookbook, you will work with the best tools to streamline your feature engineering pipelines and techniques and simplify and improve the quality of your code. Using Python libraries such as pandas, scikit-learn, Featuretools, and Feature-engine, you’ll learn how to work with both continuous and discrete datasets and be able to transform features from unstructured datasets. You will develop the skills necessary to select the best features as well as the most suitable extraction techniques. This book will cover Python recipes that will help you automate feature engineering to simplify complex processes. You’ll also get to grips with different feature engineering strategies, such as the box-cox transform, power transform, and log transform across machine learning, reinforcement learning, and natural language processing (NLP) domains. By the end of this book, you’ll have discovered tips and practical solutions to all of your feature engineering problems.
Table of Contents (13 chapters)
close
close

Technical requirements

In this chapter, we will use the Python libraries: pandas, NumPy and scikit-learn. I recommend installing the free Anaconda Python distribution (https://www.anaconda.com/distribution/), which contains all these packages.

For details on how to install the Python Anaconda distribution, visit the Technical requirements section in Chapter 1, Foreseeing Variable Problems When Building ML Models.

We will also use the open source Python library called Feature-engine, which I created and can be installed using pip:

pip install feature-engine

To learn more about Feature-engine, visit the following sites:

Check that you have installed the right versions of the numerical Python libraries, which you can find in the requirement.txt file in the accompanying GitHub repository: https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook.

We will also use the Credit Approval Data Set, which is available in the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/credit+approval).

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

To prepare the dataset, follow these steps:

  1. Visit http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/.
  1. Click on crx.data to download the data:

  1. Save crx.data to the folder where you will run the following commands.

After you've downloaded the dataset, open a Jupyter Notebook or a Python IDE and run the following commands.

  1. Import the required Python libraries:
import random
import pandas as pd
import numpy as np
  1. Load the data with the following command:
data = pd.read_csv('crx.data', header=None)
  1. Create a list with variable names:
varnames = ['A'+str(s) for s in range(1,17)]
  1. Add the variable names to the dataframe:
data.columns = varnames
  1. Replace the question marks (?) in the dataset with NumPy NaN values:
data = data.replace('?', np.nan)
  1. Recast the numerical variables as float data types:
data['A2'] = data['A2'].astype('float')
data['A14'] = data['A14'].astype('float')

  1. Recode the target variable as binary:
data['A16'] = data['A16'].map({'+':1, '-':0})

To demonstrate the recipes in this chapter, we will introduce missing data at random in four additional variables in this dataset.

  1. Add some missing values at random positions in four variables:
random.seed(9001)
values = set([random.randint(0, len(data)) for p in range(0, 100)])
for var in ['A3', 'A8', 'A9', 'A10']:
data.loc[values, var] = np.nan

With random.randint(), we extracted random digits between 0 and the number of observations in the dataset, which is given by len(data), and used these digits as the indices of the dataframe where we introduce the NumPy NaN values.

Setting the seed, as specified in step 11, should allow you to obtain the results provided by the recipes in this chapter.
  1. Save your prepared data:
data.to_csv('creditApprovalUCI.csv', index=False)

Now, you are ready to carry on with the recipes in this chapter.

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Python Feature Engineering Cookbook
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon