Python Data Cleaning Cookbook

By : Michael Walker

Python Data Cleaning Cookbook

By: Michael Walker

Overview of this book

Getting clean data to reveal insights is essential, as directly jumping into data analysis without proper data cleaning may lead to incorrect results. This book shows you tools and techniques that you can apply to clean and handle data with Python. You'll begin by getting familiar with the shape of data by using practices that can be deployed routinely with most data sources. Then, the book teaches you how to manipulate data to get it into a useful form. You'll also learn how to filter and summarize data to gain insights and better understand what makes sense and what does not, along with discovering how to operate on data to address the issues you've identified. Moving on, you'll perform key tasks, such as handling missing values, validating errors, removing duplicate data, monitoring high volumes of data, and handling outliers and invalid dates. Next, you'll cover recipes on using supervised learning and Naive Bayes analysis to identify unexpected values and classification errors, and generate visualizations for exploratory data analysis (EDA) to visualize unexpected values. Finally, you'll build functions and classes that you can reuse without modification when you have new data. By the end of this Python book, you'll be equipped with all the key skills that you need to clean data and diagnose problems within it.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the color images

Conventions used

Sections

Get in touch

Reviews

Chapter 1: Anticipating Data Cleaning Issues when Importing Tabular Data into pandas

Technical requirements

Importing CSV files

Importing Excel files

Importing data from SQL databases

Importing SPSS, Stata, and SAS data

Importing R data

Persisting tabular data

Free Chapter

Chapter 2: Anticipating Data Cleaning Issues when Importing HTML and JSON into pandas

Technical requirements

Importing simple JSON data

Importing more complicated JSON data from an API

Importing data from web pages

Persisting JSON data

Chapter 3: Taking the Measure of Your Data

Technical requirements

Getting a first look at your data

Selecting and organizing columns

Selecting rows

Generating frequencies for categorical variables

Generating summary statistics for continuous variables

Chapter 4: Identifying Missing Values and Outliers in Subsets of Data

Technical requirements

Finding missing values

Identifying outliers with one variable

Identifying outliers and unexpected values in bivariate relationships

Using subsetting to examine logical inconsistencies in variable relationships

Using linear regression to identify data points with significant influence

Using k-nearest neighbor to find outliers

Using Isolation Forest to find anomalies

Chapter 5: Using Visualizations for the Identification of Unexpected Values

Technical requirements

Using histograms to examine the distribution of continuous variables

Using boxplots to identify outliers for continuous variables

Using grouped boxplots to uncover unexpected values in a particular group

Examining both the distribution shape and outliers with violin plots

Using scatter plots to view bivariate relationships

Using line plots to examine trends in continuous variables

Generating a heat map based on a correlation matrix

Chapter 6: Cleaning and Exploring Data with Series Operations

Technical requirements

Getting values from a pandas series

Showing summary statistics for a pandas series

Changing series values

Changing series values conditionally

Evaluating and cleaning string series data

Working with dates

Identifying and cleaning missing data

Missing value imputation with K-nearest neighbor

Chapter 7: Fixing Messy Data when Aggregating

Technical requirements

Looping through data with itertuples (an anti-pattern)

Calculating summaries by group with NumPy arrays

Using groupby to organize data by groups

Using more complicated aggregation functions with groupby

Using user-defined functions and apply with groupby

Using groupby to change the unit of analysis of a DataFrame

Chapter 8: Addressing Data Issues When Combining DataFrames

Technical requirements

Combining DataFrames vertically

Doing one-to-one merges

Using multiple merge-by columns

Doing one-to-many merges

Doing many-to-many merges

Developing a merge routine

Chapter 9: Tidying and Reshaping Data

Technical requirements

Removing duplicated rows

Fixing many-to-many relationships

Using stack and melt to reshape data from wide to long format

Melting multiple groups of columns

Using unstack and pivot to reshape data from long to wide

Chapter 10: User-Defined Functions and Classes to Automate Data Cleaning

Technical requirements

Functions for getting a first look at our data

Functions for displaying summary statistics and frequencies

Functions for identifying outliers and unexpected values

Functions for aggregating or combining data

Classes that contain the logic for updating series values

Classes that handle non-tabular data structures

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Importing simple JSON data

JavaScript Object Notation (JSON) has turned out to be an incredibly useful standard for transferring data from one machine, process, or node to another. Often a client sends a data request to a server, upon which that server queries the data in the local storage and then converts it from something like a SQL Server table or tables into JSON, which the client can consume. This is sometimes complicated further by the first server (say, a web server) forwarding the request to a database server. JSON facilitates this, as does XML, by doing the following:

Being readable by humans
Being consumable by most client devices
Not being limited in structure

JSON is quite flexible, which means that it can accommodate just about anything. The structure can even change within a JSON file, so different keys might be present at different points. For example, the file might begin with some explanatory keys that have a very different structure than the remaining data keys. Or some keys might be present in some cases, but not others. We go over some approaches for dealing with that messiness (uh, I mean flexibility).

Getting ready

We are going to work with data on news stories about political candidates in this recipe. This data is made available for public use at dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/0ZLHOK. I have combined the JSON files there into one file and randomly selected 60,000 news stories from the combined data. This sample (allcandidatenewssample.json) is available in the GitHub repository of this book.

We will do a little work with list and dictionary comprehensions in this recipe. DataCamp has good guides to list comprehensions (https://www.datacamp.com/community/tutorials/python-list-comprehension) and dictionary comprehensions (https://www.datacamp.com/community/tutorials/python-dictionary-comprehension) if you are feeling a little rusty.

How to do it…

We will import a JSON file into pandas after doing some data checking and cleaning:

Import the json and pprint libraries.
pprint improves the display of the lists and dictionaries that are returned when we load JSON data:
```
>>> import pandas as pd
>>> import numpy as np
>>> import json
>>> import pprint
>>> from collections import Counter
```

Load the JSON data and look for potential issues.

Use the json load method to return data on news stories about political candidates. load returns a list of dictionaries. Use len to get the size of the list, which is the total number of news stories in this case. (Each list item is a dictionary with keys for the title, source, and so on, and their respective values.) Use pprint to display the first two dictionaries. Get the value from the source key for the first list item:

>>> with open('data/allcandidatenewssample.json') as f:
...   candidatenews = json.load(f)
... 
>>> len(candidatenews)
60000
>>> pprint.pprint(candidatenews[0:2])
[{'date': '2019-12-25 10:00:00',
  'domain': 'www.nbcnews.com',
  'panel_position': 1,
  'query': 'Michael Bloomberg',
  'source': 'NBC News',
  'story_position': 6,
  'time': '18 hours ago',
  'title': 'Bloomberg cuts ties with company using prison inmates to make '
           'campaign calls',
  'url': 'https://www.nbcnews.com/politics/2020-election/bloomberg-cuts-ties-company-using-prison-inmates-make-campaign-calls-n1106971'},
 {'date': '2019-11-09 08:00:00',
  'domain': 'www.townandcountrymag.com',
  'panel_position': 1,
  'query': 'Amy Klobuchar',
  'source': 'Town & Country Magazine',
  'story_position': 3,
  'time': '18 hours ago',
  'title': "Democratic Candidates React to Michael Bloomberg's Potential Run",
  'url': 'https://www.townandcountrymag.com/society/politics/a29739854/michael-bloomberg-democratic-candidates-campaign-reactions/'}]
>>> pprint.pprint(candidatenews[0]['source'])
'NBC News'

Check for differences in the structure of the dictionaries.

Use Counter to check for any dictionaries in the list with fewer than, or more than, the nine keys that is normal. Look at a few of the dictionaries with almost no data (those with just two keys) before removing them. Confirm that the remaining list of dictionaries has the expected length – 60000-2382=57618:

>>> Counter([len(item) for item in candidatenews])
Counter({9: 57202, 2: 2382, 10: 416})
>>> pprint.pprint(next(item for item in candidatenews if len(item)<9))
{'date': '2019-09-11 18:00:00', 'reason': 'Not collected'}
>>> pprint.pprint(next(item for item in candidatenews if len(item)>9))
{'category': 'Satire',
 'date': '2019-08-21 04:00:00',
 'domain': 'politics.theonion.com',
 'panel_position': 1,
 'query': 'John Hickenlooper',
 'source': 'Politics | The Onion',
 'story_position': 8,
 'time': '4 days ago',
 'title': ''And Then There Were 23,' Says Wayne Messam Crossing Out '
          'Hickenlooper Photo \n'
          'In Elaborate Grid Of Rivals',
 'url': 'https://politics.theonion.com/and-then-there-were-23-says-wayne-messam-crossing-ou-1837311060'}
>>> pprint.pprint([item for item in candidatenews if len(item)==2][0:10])
[{'date': '2019-09-11 18:00:00', 'reason': 'Not collected'},
 {'date': '2019-07-24 00:00:00', 'reason': 'No Top stories'},
... 
 {'date': '2019-01-03 00:00:00', 'reason': 'No Top stories'}]
>>> candidatenews = [item for item in candidatenews if len(item)>2]
>>> len(candidatenews)
57618

Generate counts from the JSON data.

Get the dictionaries just for Politico (a website that covers political news) and display a couple of dictionaries:

>>> politico = [item for item in candidatenews if item["source"] == "Politico"]
>>> len(politico)
2732
>>> pprint.pprint(politico[0:2])
[{'date': '2019-05-18 18:00:00',
  'domain': 'www.politico.com',
  'panel_position': 1,
  'query': 'Marianne Williamson',
  'source': 'Politico',
  'story_position': 7,
  'time': '1 week ago',
  'title': 'Marianne Williamson reaches donor threshold for Dem debates',
  'url': 'https://www.politico.com/story/2019/05/09/marianne-williamson-2020-election-1315133'},
 {'date': '2018-12-27 06:00:00',
  'domain': 'www.politico.com',
  'panel_position': 1,
  'query': 'Julian Castro',
  'source': 'Politico',
  'story_position': 1,
  'time': '1 hour ago',
  'title': "O'Rourke and Castro on collision course in Texas",
  'url': 'https://www.politico.com/story/2018/12/27/orourke-julian-castro-collision-texas-election-1073720'}]

Get the source data and confirm that it has the anticipated length.

Show the first few items in the new sources list. Generate a count of news stories by source and display the 10 most popular sources. Notice that stories from The Hill can have TheHill (without a space) or The Hill as the value for source:

>>> sources = [item.get('source') for item in candidatenews]
>>> type(sources)
<class 'list'>
>>> len(sources)
57618
>>> sources[0:5]
['NBC News', 'Town & Country Magazine', 'TheHill', 'CNBC.com', 'Fox News']
>>> pprint.pprint(Counter(sources).most_common(10))
[('Fox News', 3530),
 ('CNN.com', 2750),
 ('Politico', 2732),
 ('TheHill', 2383),
 ('The New York Times', 1804),
 ('Washington Post', 1770),
 ('Washington Examiner', 1655),
 ('The Hill', 1342),
 ('New York Post', 1275),
 ('Vox', 941)]

Fix any errors in the values in the dictionary.

Fix the source values for The Hill. Notice that The Hill is now the most frequent source for news stories:

>>> for newsdict in candidatenews:
...     newsdict.update((k, "The Hill") for k, v in newsdict.items()
...      if k == "source" and v == "TheHill")
... 
>>> sources = [item.get('source') for item in candidatenews]
>>> pprint.pprint(Counter(sources).most_common(10))
[('The Hill', 3725),
 ('Fox News', 3530),
 ('CNN.com', 2750),
 ('Politico', 2732),
 ('The New York Times', 1804),
 ('Washington Post', 1770),
 ('Washington Examiner', 1655),
 ('New York Post', 1275),
 ('Vox', 941),
 ('Breitbart', 799)]

Create a pandas DataFrame.

Pass the JSON data to the pandas DataFrame method. Convert the date column to a datetime data type:

>>> candidatenewsdf = pd.DataFrame(candidatenews)
>>> candidatenewsdf.dtypes
title             object
url               object
source            object
time              object
date              object
query             object
story_position     int64
panel_position    object
domain            object
category          object
dtype: object

Confirm that we are getting the expected values for source.

Also, rename the date column:

>>> candidatenewsdf.rename(columns={'date':'storydate'}, inplace=True)
>>> candidatenewsdf.storydate = candidatenewsdf.storydate.astype('datetime64[ns]')
>>> candidatenewsdf.shape
(57618, 10)
>>> candidatenewsdf.source.value_counts(sort=True).head(10)
The Hill               3725
Fox News               3530
CNN.com                2750
Politico               2732
The New York Times     1804
Washington Post        1770
Washington Examiner    1655
New York Post          1275
Vox                     941
Breitbart               799
Name: source, dtype: int64

We now have a pandas DataFrame with only the news stories where there is meaningful data, and with the values for source fixed.

How it works…

The json.load method returns a list of dictionaries. This makes it possible to use a number of familiar tools when working with this data: list methods, slicing, list comprehensions, dictionary updates, and so on. There are times, maybe when you just have to populate a list or count the number of individuals in a given category, when there is no need to use pandas.

In steps 2 to 6, we use list methods to do many of the same checks we have done with pandas in previous recipes. In step 3 we use Counter with a list comprehension (Counter([len(item) for item in candidatenews])) to get the number of keys in each dictionary. This tells us that there are 2,382 dictionaries with just 2 keys and 416 with 10. We use next to look for an example of dictionaries with fewer than 9 keys or more than 9 keys to get a sense of the structure of those items. We use slicing to show 10 dictionaries with 2 keys to see if there is any data in those dictionaries. We then select only those dictionaries with more than 2 keys.

In step 4 we create a subset of the list of dictionaries, one that just has source equal to Politico, and take a look at a couple of items. We then create a list with just the source data and use Counter to list the 10 most common sources in step 5.

Step 6 demonstrates how to replace key values conditionally in a list of dictionaries. In this case, we update the key value to The Hill whenever key (k) is source and value (v) is TheHill. The for k, v in newsdict.items() section is the unsung hero of this line. It loops through all key/value pairs for all dictionaries in candidatenews.

It is easy to create a pandas DataFrame by passing the list of dictionaries to the pandas DataFrame method. We do this in step 7. The main complication is that we need to convert the date column from a string to a date, since dates are just strings in JSON.

There's more…

In steps 5 and 6 we use item.get('source') instead of item['source']. This is handy when there might be missing keys in a dictionary. get returns None when the key is missing, but we can use an optional second argument to specify a value to return.

I renamed the date column to storydate in step 8. This is not necessary, but is a good idea. Not only does date not tell you anything about what the dates actually represent, it is also so generic a column name that it is bound to cause problems at some point.

The news stories data fits nicely into a tabular structure. It makes sense to represent each list item as one row, and the key/value pairs as columns and column values for that row. There are no significant complications, such as key values that are themselves lists of dictionaries. Imagine an authors key for each story with a list item for each author as the key value, and that list item is a dictionary of information about the author. This is not at all unusual when working with JSON data in Python. The next recipe shows how to work with data structured in this way.

Python Data Cleaning Cookbook

By : Michael Walker

Python Data Cleaning Cookbook

By: Michael Walker

Overview of this book

Related Content you might be interested in

Current Title:

Python Data Cleaning Cookbook

Data Cleaning and Exploration with Machine Learning

Mastering Exploratory Analysis with pandas

Pandas Cookbook

Importing simple JSON data

Getting ready

How to do it…

How it works…

There's more…