Python Data Cleaning Cookbook

By : Michael Walker

Python Data Cleaning Cookbook

By: Michael Walker

Overview of this book

Getting clean data to reveal insights is essential, as directly jumping into data analysis without proper data cleaning may lead to incorrect results. This book shows you tools and techniques that you can apply to clean and handle data with Python. You'll begin by getting familiar with the shape of data by using practices that can be deployed routinely with most data sources. Then, the book teaches you how to manipulate data to get it into a useful form. You'll also learn how to filter and summarize data to gain insights and better understand what makes sense and what does not, along with discovering how to operate on data to address the issues you've identified. Moving on, you'll perform key tasks, such as handling missing values, validating errors, removing duplicate data, monitoring high volumes of data, and handling outliers and invalid dates. Next, you'll cover recipes on using supervised learning and Naive Bayes analysis to identify unexpected values and classification errors, and generate visualizations for exploratory data analysis (EDA) to visualize unexpected values. Finally, you'll build functions and classes that you can reuse without modification when you have new data. By the end of this Python book, you'll be equipped with all the key skills that you need to clean data and diagnose problems within it.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the color images

Conventions used

Sections

Get in touch

Reviews

Chapter 1: Anticipating Data Cleaning Issues when Importing Tabular Data into pandas

Technical requirements

Importing CSV files

Importing Excel files

Importing data from SQL databases

Importing SPSS, Stata, and SAS data

Importing R data

Persisting tabular data

Free Chapter

Chapter 2: Anticipating Data Cleaning Issues when Importing HTML and JSON into pandas

Technical requirements

Importing simple JSON data

Importing more complicated JSON data from an API

Importing data from web pages

Persisting JSON data

Chapter 3: Taking the Measure of Your Data

Technical requirements

Getting a first look at your data

Selecting and organizing columns

Selecting rows

Generating frequencies for categorical variables

Generating summary statistics for continuous variables

Chapter 4: Identifying Missing Values and Outliers in Subsets of Data

Technical requirements

Finding missing values

Identifying outliers with one variable

Identifying outliers and unexpected values in bivariate relationships

Using subsetting to examine logical inconsistencies in variable relationships

Using linear regression to identify data points with significant influence

Using k-nearest neighbor to find outliers

Using Isolation Forest to find anomalies

Chapter 5: Using Visualizations for the Identification of Unexpected Values

Technical requirements

Using histograms to examine the distribution of continuous variables

Using boxplots to identify outliers for continuous variables

Using grouped boxplots to uncover unexpected values in a particular group

Examining both the distribution shape and outliers with violin plots

Using scatter plots to view bivariate relationships

Using line plots to examine trends in continuous variables

Generating a heat map based on a correlation matrix

Chapter 6: Cleaning and Exploring Data with Series Operations

Technical requirements

Getting values from a pandas series

Showing summary statistics for a pandas series

Changing series values

Changing series values conditionally

Evaluating and cleaning string series data

Working with dates

Identifying and cleaning missing data

Missing value imputation with K-nearest neighbor

Chapter 7: Fixing Messy Data when Aggregating

Technical requirements

Looping through data with itertuples (an anti-pattern)

Calculating summaries by group with NumPy arrays

Using groupby to organize data by groups

Using more complicated aggregation functions with groupby

Using user-defined functions and apply with groupby

Using groupby to change the unit of analysis of a DataFrame

Chapter 8: Addressing Data Issues When Combining DataFrames

Technical requirements

Combining DataFrames vertically

Doing one-to-one merges

Using multiple merge-by columns

Doing one-to-many merges

Doing many-to-many merges

Developing a merge routine

Chapter 9: Tidying and Reshaping Data

Technical requirements

Removing duplicated rows

Fixing many-to-many relationships

Using stack and melt to reshape data from wide to long format

Melting multiple groups of columns

Using unstack and pivot to reshape data from long to wide

Chapter 10: User-Defined Functions and Classes to Automate Data Cleaning

Technical requirements

Functions for getting a first look at our data

Functions for displaying summary statistics and frequencies

Functions for identifying outliers and unexpected values

Functions for aggregating or combining data

Classes that contain the logic for updating series values

Classes that handle non-tabular data structures

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Identifying outliers and unexpected values in bivariate relationships

A value might be unexpected, even if it is not an extreme value, when it does not deviate significantly from the distribution mean. Some values for a variable are unexpected when a second variable has certain values. This is easy to illustrate when one variable is categorical and the other is continuous.

The following diagram illustrates the number of bird sightings per day over a several year period, but shows different distributions for each of the two sites. One site has a mean sightings per day of 33, and the other 52. (This is fictional data.) The overall mean (not shown) is 42. What should we make of a value of 58 for daily sightings? Is that an outlier? That clearly depends on which of the two sites was being observed. If there were 58 sightings on a day at site A, 58 would be an unusually high number. Not so for site B, where 58 sightings would not be very different from the mean for that site:

...

Python Data Cleaning Cookbook

By : Michael Walker

Python Data Cleaning Cookbook

By: Michael Walker

Overview of this book

Related Content you might be interested in

Current Title:

Python Data Cleaning Cookbook

Data Cleaning and Exploration with Machine Learning

Mastering Exploratory Analysis with pandas

Pandas Cookbook

Identifying outliers and unexpected values in bivariate relationships