Python Data Cleaning Cookbook

By : Michael Walker

Python Data Cleaning Cookbook

By: Michael Walker

Overview of this book

Getting clean data to reveal insights is essential, as directly jumping into data analysis without proper data cleaning may lead to incorrect results. This book shows you tools and techniques that you can apply to clean and handle data with Python. You'll begin by getting familiar with the shape of data by using practices that can be deployed routinely with most data sources. Then, the book teaches you how to manipulate data to get it into a useful form. You'll also learn how to filter and summarize data to gain insights and better understand what makes sense and what does not, along with discovering how to operate on data to address the issues you've identified. Moving on, you'll perform key tasks, such as handling missing values, validating errors, removing duplicate data, monitoring high volumes of data, and handling outliers and invalid dates. Next, you'll cover recipes on using supervised learning and Naive Bayes analysis to identify unexpected values and classification errors, and generate visualizations for exploratory data analysis (EDA) to visualize unexpected values. Finally, you'll build functions and classes that you can reuse without modification when you have new data. By the end of this Python book, you'll be equipped with all the key skills that you need to clean data and diagnose problems within it.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the color images

Conventions used

Sections

Get in touch

Reviews

Chapter 1: Anticipating Data Cleaning Issues when Importing Tabular Data into pandas

Technical requirements

Importing CSV files

Importing Excel files

Importing data from SQL databases

Importing SPSS, Stata, and SAS data

Importing R data

Persisting tabular data

Free Chapter

Chapter 2: Anticipating Data Cleaning Issues when Importing HTML and JSON into pandas

Technical requirements

Importing simple JSON data

Importing more complicated JSON data from an API

Importing data from web pages

Persisting JSON data

Chapter 3: Taking the Measure of Your Data

Technical requirements

Getting a first look at your data

Selecting and organizing columns

Selecting rows

Generating frequencies for categorical variables

Generating summary statistics for continuous variables

Chapter 4: Identifying Missing Values and Outliers in Subsets of Data

Technical requirements

Finding missing values

Identifying outliers with one variable

Identifying outliers and unexpected values in bivariate relationships

Using subsetting to examine logical inconsistencies in variable relationships

Using linear regression to identify data points with significant influence

Using k-nearest neighbor to find outliers

Using Isolation Forest to find anomalies

Chapter 5: Using Visualizations for the Identification of Unexpected Values

Technical requirements

Using histograms to examine the distribution of continuous variables

Using boxplots to identify outliers for continuous variables

Using grouped boxplots to uncover unexpected values in a particular group

Examining both the distribution shape and outliers with violin plots

Using scatter plots to view bivariate relationships

Using line plots to examine trends in continuous variables

Generating a heat map based on a correlation matrix

Chapter 6: Cleaning and Exploring Data with Series Operations

Technical requirements

Getting values from a pandas series

Showing summary statistics for a pandas series

Changing series values

Changing series values conditionally

Evaluating and cleaning string series data

Working with dates

Identifying and cleaning missing data

Missing value imputation with K-nearest neighbor

Chapter 7: Fixing Messy Data when Aggregating

Technical requirements

Looping through data with itertuples (an anti-pattern)

Calculating summaries by group with NumPy arrays

Using groupby to organize data by groups

Using more complicated aggregation functions with groupby

Using user-defined functions and apply with groupby

Using groupby to change the unit of analysis of a DataFrame

Chapter 8: Addressing Data Issues When Combining DataFrames

Technical requirements

Combining DataFrames vertically

Doing one-to-one merges

Using multiple merge-by columns

Doing one-to-many merges

Doing many-to-many merges

Developing a merge routine

Chapter 9: Tidying and Reshaping Data

Technical requirements

Removing duplicated rows

Fixing many-to-many relationships

Using stack and melt to reshape data from wide to long format

Melting multiple groups of columns

Using unstack and pivot to reshape data from long to wide

Chapter 10: User-Defined Functions and Classes to Automate Data Cleaning

Technical requirements

Functions for getting a first look at our data

Functions for displaying summary statistics and frequencies

Functions for identifying outliers and unexpected values

Functions for aggregating or combining data

Classes that contain the logic for updating series values

Classes that handle non-tabular data structures

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Using stack and melt to reshape data from wide to long format

One type of untidiness that Wickham identified is variable values embedded in column names. Although this rarely happens with enterprise or relational data, it is fairly common with analytical or survey data. Variable names might have suffixes that indicate a time period, such as a month or year. Another case is that similar variables on a survey might have similar names, such as familymember1age, familymember2age, and so on, because that is convenient and consistent with the survey designers' understanding of the variable.

One reason why this messiness happens relatively frequently with survey data is that there can be multiple units of analysis on one survey instrument. An example is the United States decennial census, which asks both household and person questions. Survey data is also sometimes made up of repeated measures or panel data, but nonetheless often has only one row per respondent. When this is the case...

Python Data Cleaning Cookbook

By : Michael Walker

Python Data Cleaning Cookbook

By: Michael Walker

Overview of this book

Related Content you might be interested in

Current Title:

Python Data Cleaning Cookbook

Data Cleaning and Exploration with Machine Learning

Mastering Exploratory Analysis with pandas

Pandas Cookbook

Using stack and melt to reshape data from wide to long format