Book Image

Python Data Cleaning Cookbook

By : Michael Walker
Book Image

Python Data Cleaning Cookbook

By: Michael Walker

Overview of this book

Getting clean data to reveal insights is essential, as directly jumping into data analysis without proper data cleaning may lead to incorrect results. This book shows you tools and techniques that you can apply to clean and handle data with Python. You'll begin by getting familiar with the shape of data by using practices that can be deployed routinely with most data sources. Then, the book teaches you how to manipulate data to get it into a useful form. You'll also learn how to filter and summarize data to gain insights and better understand what makes sense and what does not, along with discovering how to operate on data to address the issues you've identified. Moving on, you'll perform key tasks, such as handling missing values, validating errors, removing duplicate data, monitoring high volumes of data, and handling outliers and invalid dates. Next, you'll cover recipes on using supervised learning and Naive Bayes analysis to identify unexpected values and classification errors, and generate visualizations for exploratory data analysis (EDA) to visualize unexpected values. Finally, you'll build functions and classes that you can reuse without modification when you have new data. By the end of this Python book, you'll be equipped with all the key skills that you need to clean data and diagnose problems within it.
Table of Contents (12 chapters)

Persisting JSON data

There are several reasons why we might want to serialize a JSON file:

  • We may have retrieved the data with an API, but need to keep a snapshot of the data.
  • The data in the JSON file is relatively static and informs our data cleaning and analysis over multiple phases of a project.
  • We might decide that the flexibility of a schema-less format such as JSON helps us solve many data cleaning and analysis problems.

It is worth highlighting this last reason to use JSON – that it can solve many data problems. Although tabular data structures clearly have many benefits, particularly for operational data, they are often not the best way to store data for analysis purposes. In preparing data for analysis, a substantial amount of time is spent either merging data from different tables or dealing with data redundancy when working with flat files. Not only are these processes time consuming, but every merge or reshaping leaves the door open to a data error of broad scope. This can also mean that we end up paying too much attention to the mechanics of manipulating data and too little to the conceptual issues at the core of our work.

We return to the Cleveland Museum of Art collections data in this recipe. There are at least three possible units of analysis for this data file – the collection item level, the creator level, and the citation level. JSON allows us to nest citations and creators within collections. (You can examine the structure of the JSON file in the Getting ready section of this recipe.) This data cannot be persisted in a tabular structure without flattening the file, which we did in an earlier recipe in this chapter. In this recipe, we will use two different methods to persist JSON data, each with its own advantages and disadvantages.

Getting ready

We will be working with data on the Cleveland Museum of Art's collection of works by African-American artists. The following is the structure of the JSON data returned by the API. It has been abbreviated to save space:

{"info": { "total": 778, "parameters": {"african_american_artists": "" }}, 
"data": [
"id": 165157, 
"accession_number": "2007.158", 
"title": "Fulton and Nostrand", 
"creation_date": "1958", 
"citations": [
   "citation": "Annual Exhibition: Sculpture, Paintings...", 
   "page_number": "Unpaginated, [8],[12]", 
   "url": null
   "citation": "\"Moscow to See Modern U.S. Art,\"<em> New York...",   
   "page_number": "P. 60",
   "url": null
"creators": [
     "description": "Jacob Lawrence (American, 1917-2000)", 
     "extent": null, 
     "qualifier": null, 
     "role": "artist", 
     "birth_year": "1917", 
     "death_year": "2000"

How to do it...

We will serialize the JSON data using two different methods:

  1. Load the pandas, json, pprint, requests, and msgpack libraries:
    >>> import pandas as pd
    >>> import json
    >>> import pprint
    >>> import requests
    >>> import msgpack
  2. Load the JSON data from an API. I have abbreviated the JSON output:
    >>> response = requests.get("")
    >>> camcollections = json.loads(response.text)
    >>> print(len(camcollections['data']))
    >>> pprint.pprint(camcollections['data'][0])
    {'accession_number': '2007.158',
     'catalogue_raisonne': None,
     'citations': [{'citation': 'Annual Exhibition: Sculpture...',
                    'page_number': 'Unpaginated, [8],[12]',
                    'url': None},
                   {'citation': '"Moscow to See Modern U.S....',
                    'page_number': 'P. 60',
                    'url': None}]
     'collection': 'American - Painting',
     'creation_date': '1958',
     'creators': [{'biography': 'Jacob Lawrence (born 1917)...',
                   'birth_year': '1917',
                   'description': 'Jacob Lawrence (American...',
                   'role': 'artist'}],
     'type': 'Painting'}
  3. Save and reload the JSON file using Python's json library.

    Persist the JSON data in human-readable form. Reload it from the saved file and confirm that it worked by retrieving the creators data from the first collections item:

    >>> with open("data/camcollections.json","w") as f:
    ...   json.dump(camcollections, f)
    >>> with open("data/camcollections.json","r") as f:
    ...   camcollections = json.load(f)
    >>> pprint.pprint(camcollections['data'][0]['creators'])
    [{'biography': 'Jacob Lawrence (born 1917) has been a prominent artist since...'
      'birth_year': '1917',
      'description': 'Jacob Lawrence (American, 1917-2000)',
      'role': 'artist'}]
  4. Save and reload the JSON file using msgpack:
    >>> with open("data/camcollections.msgpack", "wb") as outfile:
    ...     packed = msgpack.packb(camcollections)
    ...     outfile.write(packed)
    >>> with open("data/camcollections.msgpack", "rb") as data_file:
    ...     msgbytes =
    >>> camcollections = msgpack.unpackb(msgbytes)
    >>> pprint.pprint(camcollections['data'][0]['creators'])
    [{'biography': 'Jacob Lawrence (born 1917) has been a prominent...',
      'birth_year': '1917',
      'death_year': '2000',
      'description': 'Jacob Lawrence (American, 1917-2000)',
      'role': 'artist'}]

How it works…

We use the Cleveland Museum of Art's collections API to retrieve collections items. The african_american_artists flag in the query string indicates that we just want collections for those creators. json.loads returns a dictionary called info and a list of dictionaries called data. We check the length of the data list. This tells us that there are 778 items in collections. We then display the first item of collections to get a better look at the structure of the data. (I have abbreviated the JSON output.)

We save and then reload the data using Python's JSON library in step 3. The advantage of persisting the data in this way is that it keeps the data in human-readable form. Unfortunately, it has two disadvantages: saving takes longer than alternative serialization methods, and it uses more storage space.

In step 4, we use msgpack to persist our data. This is faster than Python's json library, and the saved file uses less space. Of course, the disadvantage is that the resulting JSON is binary rather than text-based.

There's more…

I use both methods for persisting JSON data in my work. When I am working with small amounts of data, and that data is relatively static, I prefer human-readable JSON. A great use case for this is the recipes in the previous chapter where we needed to create value labels.

I use msgpack when I am working with large amounts of data, where that data changes regularly. msgpack files are also great when you want to take regular snapshots of key tables in enterprise databases.

The Cleveland Museum of Art's collections data is similar in at least one important way to the data we work with every day. The unit of analysis frequently changes. Here we are looking at collections, citations, and creators. In our work, we might have to simultaneously look at students and courses, or households and deposits. An enterprise database system for the museum data would likely have separate collections, citations, and creators tables that we would eventually need to merge. The resulting merged file would have data redundancy issues that we would need to account for whenever we changed the unit of analysis.

When we alter our data cleaning process to work directly from JSON or parts of it, we end up eliminating a major source of errors. We do more data cleaning with JSON in the Classes that handle non-tabular data structures recipe in Chapter 10, User-Defined Functions and Classes to Automate Data Cleaning.