Book Image

Python Data Cleaning Cookbook

By : Michael Walker
Book Image

Python Data Cleaning Cookbook

By: Michael Walker

Overview of this book

Getting clean data to reveal insights is essential, as directly jumping into data analysis without proper data cleaning may lead to incorrect results. This book shows you tools and techniques that you can apply to clean and handle data with Python. You'll begin by getting familiar with the shape of data by using practices that can be deployed routinely with most data sources. Then, the book teaches you how to manipulate data to get it into a useful form. You'll also learn how to filter and summarize data to gain insights and better understand what makes sense and what does not, along with discovering how to operate on data to address the issues you've identified. Moving on, you'll perform key tasks, such as handling missing values, validating errors, removing duplicate data, monitoring high volumes of data, and handling outliers and invalid dates. Next, you'll cover recipes on using supervised learning and Naive Bayes analysis to identify unexpected values and classification errors, and generate visualizations for exploratory data analysis (EDA) to visualize unexpected values. Finally, you'll build functions and classes that you can reuse without modification when you have new data. By the end of this Python book, you'll be equipped with all the key skills that you need to clean data and diagnose problems within it.
Table of Contents (12 chapters)

Importing more complicated JSON data from an API

In the previous recipe, we discussed one significant advantage (and challenge) of working with JSON data – its flexibility. A JSON file can have just about any structure its authors can imagine. This often means that this data does not have the tabular structure of the data sources we have discussed so far, and that pandas DataFrames have. Often, analysts and application developers use JSON precisely because it does not insist on a tabular structure. I know I do!

Retrieving data from multiple tables often requires us to do a one-to-many merge. Saving that data to one table or file means duplicating data on the "one" side of the one-to-many relationship. For example, student demographic data is merged with data on the courses studied, and the demographic data is repeated for each course. With JSON, duplication is not required to capture these items of data in one file. We can have data on the courses studied nested within the data for each student.

But doing analysis with JSON structured in this way will eventually require us to either: 1) manipulate the data in a very different way than we are used to doing; or 2) convert the JSON to a tabular form. We examine the first approach in the Classes that handle non-tabular data structures recipe in Chapter 10, User-Defined Functions and Classes to Automate Data Cleaning. This recipe takes the second approach. It uses a very handy tool for converting selected nodes of JSON to a tabular structure – json_normalize.

We first use an API to get JSON data because that is how JSON is frequently consumed. One advantage of retrieving the data with an API, rather than working from a file we have saved locally, is that it is easier to rerun our code when the source data is refreshed.

Getting ready

This recipe assumes you have the requests and pprint libraries already installed. If they are not installed, you can install them with pip. From the terminal (or PowerShell in Windows), enter pip install requests and pip install pprint.

The following is the structure of the JSON file that is created when using the collections API of the Cleveland Museum of Art. There is a helpful info section at the beginning, but we are interested in the data section. This data does not fit nicely into a tabular data structure. There may be several citations objects and several creators objects for each collection object. I have abbreviated the JSON file to save space:

{"info": { "total": 778, "parameters": {"african_american_artists": "" }}, 
"data": [
{
"id": 165157, 
"accession_number": "2007.158", 
"title": "Fulton and Nostrand", 
"creation_date": "1958", 
"citations": [
  {
   "citation": "Annual Exhibition: Sculpture, Paintings...", 
   "page_number": "Unpaginated, [8],[12]", 
   "url": null
   }, 
  {
   "citation": "\"Moscow to See Modern U.S. Art,\"<em> New York...",   
   "page_number": "P. 60",
   "url": null
  }]
"creators": [
      {
     "description": "Jacob Lawrence (American, 1917-2000)", 
     "extent": null, 
     "qualifier": null, 
     "role": "artist", 
     "birth_year": "1917", 
     "death_year": "2000"
     }
  ]
 }

Note

The API used in this recipe is provided by the Cleveland Museum of Art. It is available for public use at https://openaccess-api.clevelandart.org/.

How to do it...

Create a DataFrame from the museum's collections data with one row for each citation, and the title and creation_date duplicated:

  1. Import the json, requests, and pprint libraries.

    We need the requests library to use an API to retrieve JSON data. pprint improves the display of lists and dictionaries:

    >>> import pandas as pd
    >>> import numpy as np
    >>> import json
    >>> import pprint
    >>> import requests
  2. Use an API to load the JSON data.

    Make a get request to the collections API of the Cleveland Museum of Art. Use the query string to indicate that you just want collections from African-American artists. Display the first collection item. I have truncated the output for the first item to save space:

    >>> response = requests.get("https://openaccess-api.clevelandart.org/api/artworks/?african_american_artists")
    >>> camcollections = json.loads(response.text)
    >>> print(len(camcollections['data']))
    778
    >>> pprint.pprint(camcollections['data'][0])
    {'accession_number': '2007.158',
     'catalogue_raisonne': None,
     'citations': [{'citation': 'Annual Exhibition: Sculpture...',
                    'page_number': 'Unpaginated, [8],[12]',
                    'url': None},
                   {'citation': '"Moscow to See Modern U.S....',
                    'page_number': 'P. 60',
                    'url': None}]
     'collection': 'American - Painting',
     'creation_date': '1958',
     'creators': [{'biography': 'Jacob Lawrence (born 1917)...',
                   'birth_year': '1917',
                   'description': 'Jacob Lawrence (American...)',
                   'role': 'artist'}],
     'type': 'Painting'}
  3. Flatten the JSON data.

    Create a DataFrame from the JSON data using the json_normalize method. Indicate that the number of citations will determine the number of rows, and that accession_number, title, creation_date, collection, creators, and type will be repeated. Observe that the data has been flattened by displaying the first two observations, transposing them with the .T option to make it easier to view:

    >>> camcollectionsdf=pd.json_normalize(camcollections['data'],/
     'citations',['accession_number','title','creation_date',/
     'collection','creators','type'])
    >>> camcollectionsdf.head(2).T
                              0                       1
    citation        Annual Exhibiti...  "Moscow to See Modern...
    page_number           Unpaginated,                     P. 60
    url                          None                       None
    accession_number         2007.158                   2007.158
    title            Fulton and No...           Fulton and No...
    creation_date                1958                       1958
    collection       American - Pa...           American - Pa...
    creators   [{'description': 'J...     [{'description': 'J...
    type                     Painting                   Painting
  4. Pull the birth_year value from creators:
    >>> creator = camcollectionsdf[:1].creators[0]
    >>> type(creator[0])
    <class 'dict'>
    >>> pprint.pprint(creator)
    [{'biography': 'Jacob Lawrence (born 1917) has been a prominent art...',
      'birth_year': '1917',
      'death_year': '2000',
      'description': 'Jacob Lawrence (American, 1917-2000)',
      'extent': None,
      'name_in_original_language': None,
      'qualifier': None,
      'role': 'artist'}]
    >>> camcollectionsdf['birthyear'] = camcollectionsdf.\
    ...   creators.apply(lambda x: x[0]['birth_year'])
    >>> camcollectionsdf.birthyear.value_counts().\
    ...   sort_index().head()
    1821    18
    1886     2
    1888     1
    1892    13
    1899    17
    Name: birthyear, dtype: int64

This gives us a pandas DataFrame with one row for each citation for each collection item, with the collection information (title, creation_date, and so on) duplicated.

How it works…

We work with a much more interesting JSON file in this recipe than in the previous one. Each object in the JSON file is an item in the collection of the Cleveland Museum of Art. Nested within each collection item are one or more citations. The only way to capture this information in a tabular DataFrame is to flatten it. There are also one or more dictionaries for creators of the collection item (the artist or artists). That dictionary (or dictionaries) contains the birth_year value that we want.

We want one row for every citation for all collection items. To understand this, imagine that we are working with relational data and have a collections table and a citations table, and that we are doing a one-to-many merge from collections to citations. We do something similar with json_normalize by using citations as the second parameter. That tells json_normalize to create one row for each citation and use the key values in each citation dictionary – for citation, page_number, and url – as data values.

The third parameter in the call to json_normalize has the list of column names for the data that will be repeated with each citation. Notice that access_number, title, creation_date, collection, creators, and type are repeated in observations one and two. Citation and page_number change. (url is the same value for the first and second citations. Otherwise, it would also change.)

This still leaves us with the problem of the creators dictionaries (there can be more than one creator). When we ran json_normalize it grabbed the value for each key we indicated (in the third parameter) and stored it in the data for that column and row, whether that value was simple text or a list of dictionaries, as is the case for creators. We take a look at the first (and in this case, only) creators item for the first collections row in step 10, naming it creator. (Note that the creators list is duplicated across all citations for a collection item, just as the values for title, creation_date, and so on are.)

We want the birth year for the first creator for each collection item, which can be found at creator[0]['birth_year']. To create a birthyear series using this, we use apply and a lambda function:

>>> camcollectionsdf['birthyear'] = camcollectionsdf.\
...   creators.apply(lambda x: x[0]['birth_year'])

We take a closer look at lambda functions in Chapter 6, Cleaning and Exploring Data with Series Operations. Here, it is helpful to think of the x as representing the creators series, so x[0] gives us the list item we want, creators[0]. We grab the value from the birth_year key.

There's more…

You may have noticed that we left out some of the JSON returned by the API in our call to json_normalize. The first parameter that we passed to json_normalize was camcollections['data']. Effectively, we ignore the info object at the beginning of the JSON data. The information we want does not start until the data object. This is not very different conceptually from the skiprows parameter in the second recipe of the previous chapter. There is sometimes metadata like this at the beginning of JSON files.

See also

The preceding recipe demonstrates some useful techniques for doing data integrity checks without pandas, including list operations and comprehensions. Those are all relevant for the data in this recipe as well.