Book Image

Python Data Cleaning Cookbook

By : Michael Walker
Book Image

Python Data Cleaning Cookbook

By: Michael Walker

Overview of this book

Getting clean data to reveal insights is essential, as directly jumping into data analysis without proper data cleaning may lead to incorrect results. This book shows you tools and techniques that you can apply to clean and handle data with Python. You'll begin by getting familiar with the shape of data by using practices that can be deployed routinely with most data sources. Then, the book teaches you how to manipulate data to get it into a useful form. You'll also learn how to filter and summarize data to gain insights and better understand what makes sense and what does not, along with discovering how to operate on data to address the issues you've identified. Moving on, you'll perform key tasks, such as handling missing values, validating errors, removing duplicate data, monitoring high volumes of data, and handling outliers and invalid dates. Next, you'll cover recipes on using supervised learning and Naive Bayes analysis to identify unexpected values and classification errors, and generate visualizations for exploratory data analysis (EDA) to visualize unexpected values. Finally, you'll build functions and classes that you can reuse without modification when you have new data. By the end of this Python book, you'll be equipped with all the key skills that you need to clean data and diagnose problems within it.
Table of Contents (12 chapters)

Importing data from web pages

We use Beautiful Soup in this recipe to scrape data from a web page and load that data into pandas. Web scraping is very useful when there is data at a website that is updated regularly, but there is no API. We can rerun our code to generate new data whenever the page is updated.

Unfortunately, the web scrapers we build can be broken when the structure of the targeted page changes. That is less likely to happen with APIs because they are designed for data exchange, and carefully curated with that end in mind. The priority for most web designers is the quality of the display of information, not the reliability and ease of data exchange. This causes data cleaning challenges unique to web scraping, including HTML elements that house the data being in surprising and changing locations, formatting tags that obfuscate the underlying data, and explanatory text that aid data interpretation being difficult to retrieve. In addition to these challenges, scraping presents data cleaning issues that are familiar, such as changing data types in columns, less than ideal headings, and missing values. We deal with data issues that occur most frequently in this recipe.

Getting ready

You will need Beautiful Soup installed to run the code in this recipe. You can install it with pip by entering pip install beautifulsoup4 in a terminal window or Windows PowerShell.

We will scrape data from a web page, find the following table in that page, and load it into a pandas DataFrame:

Figure 2.1 – COVID-19 data from six countries

Figure 2.1 – COVID-19 data from six countries

Note

I created this web page, http://www.alrb.org/datacleaning/covidcaseoutliers.html, based on COVID-19 data for public use from Our World in Data, available at https://ourworldindata.org/coronavirus-source-data.

How to do it…

We scrape the COVID data from the website and do some routine data checks:

  1. Import the pprint, requests, and BeautifulSoup libraries:
    >>> import pandas as pd
    >>> import numpy as np
    >>> import json
    >>> import pprint
    >>> import requests
    >>> from bs4 import BeautifulSoup
  2. Parse the web page and get the header row of the table.

    Use Beautiful Soup's find method to get the table we want and then use find_all to retrieve the elements nested within the th elements for that table. Create a list of column labels based on the text of the th rows:

    >>> webpage = requests.get("http://www.alrb.org/datacleaning/covidcaseoutliers.html")
    >>> bs = BeautifulSoup(webpage.text, 'html.parser')
    >>> theadrows = bs.find('table', {'id':'tblDeaths'}).thead.find_all('th')
    >>> type(theadrows)
    <class 'bs4.element.ResultSet'>
    >>> labelcols = [j.get_text() for j in theadrows]
    >>> labelcols[0] = "rowheadings"
    >>> labelcols
    ['rowheadings', 'Cases', 'Deaths', 'Cases per Million', 'Deaths per Million', 'population', 'population_density', 'median_age', 'gdp_per_capita', 'hospital_beds_per_100k']
  3. Get the data from the table cells.

    Find all of the table rows for the table we want. For each table row, find the th element and retrieve the text. We will use that text for our row labels. Also, for each row, find all the td elements (the table cells with the data) and save text from all of them in a list. This gives us datarows, which has all the numeric data in the table. (You can confirm that it matches the table from the web page.) We then insert the labelrows list (which has the row headings) at the beginning of each list in datarows:

    >>> rows = bs.find('table', {'id':'tblDeaths'}).tbody.find_all('tr')
    >>> datarows = []
    >>> labelrows = []
    >>> for row in rows:
    ...   rowlabels = row.find('th').get_text()
    ...   cells = row.find_all('td', {'class':'data'})
    ...   if (len(rowlabels)>3):
    ...     labelrows.append(rowlabels)
    ...   if (len(cells)>0):
    ...     cellvalues = [j.get_text() for j in cells]
    ...     datarows.append(cellvalues)
    ... 
    >>> pprint.pprint(datarows[0:2])
    [['9,394', '653', '214', '15', '43,851,043', '17', '29', '13,914', '1.9'],
     ['16,642', '668', '1848', '74', '9,006,400', '107', '44', '45,437', '7.4']]
    >>> pprint.pprint(labelrows[0:2])
    ['Algeria', 'Austria']
    >>> 
    >>> for i in range(len(datarows)):
    ...   datarows[i].insert(0, labelrows[i])
    ... 
    >>> pprint.pprint(datarows[0:1])
    [['Algeria','9,394','653','214','15','43,851,043','17','29','13,914','1.9']]
  4. Load the data into pandas.

    Pass the datarows list to the DataFrame method of pandas. Notice that all data is read into pandas with the object data type, and that some data has values that cannot be converted into numeric values in their current form (due to the commas):

    >>> totaldeaths = pd.DataFrame(datarows, columns=labelcols)
    >>> totaldeaths.head()
      rowheadings    Cases Deaths  ... median_age gdp_per_capita  \
    0     Algeria    9,394    653  ...         29         13,914   
    1     Austria   16,642    668  ...         44         45,437   
    2  Bangladesh   47,153    650  ...         28          3,524   
    3     Belgium   58,381   9467  ...         42         42,659   
    4      Brazil  514,849  29314  ...         34         14,103   
    >>> totaldeaths.dtypes
    rowheadings               object
    Cases                     object
    Deaths                    object
    Cases per Million         object
    Deaths per Million        object
    population                object
    population_density        object
    median_age                object
    gdp_per_capita            object
    hospital_beds_per_100k    object
    dtype: object
  5. Fix the column names and convert the data to numeric values.

    Remove spaces from column names. Remove all non-numeric data from the first columns with data, including the commas (str.replace("[^0-9]",""). Convert to numeric values, except for the rowheadings column:

    >>> totaldeaths.columns = totaldeaths.columns.str.replace(" ", "_").str.lower()
    >>> for col in totaldeaths.columns[1:-1]:
    ...   totaldeaths[col] = totaldeaths[col].\
    ...     str.replace("[^0-9]","").astype('int64')
    ... 
    >>> totaldeaths['hospital_beds_per_100k'] = totaldeaths['hospital_beds_per_100k'].astype('float')
    >>> totaldeaths.head()
      rowheadings   cases  deaths  ...  median_age  gdp_per_capita  \
    0     Algeria    9394     653  ...          29           13914   
    1     Austria   16642     668  ...          44           45437   
    2  Bangladesh   47153     650  ...          28            3524   
    3     Belgium   58381    9467  ...          42           42659   
    4      Brazil  514849   29314  ...          34           14103   
    >>> totaldeaths.dtypes
    rowheadings                object
    cases                       int64
    deaths                      int64
    cases_per_million           int64
    deaths_per_million          int64
    population                  int64
    population_density          int64
    median_age                  int64
    gdp_per_capita              int64
    hospital_beds_per_100k    float64
    dtype: object

We have now created a pandas DataFrame from an html table.

How it works…

Beautiful Soup is a very useful tool for finding specific HTML elements in a web page and retrieving text from them. You can get one HTML element with find and get one or more with find_all. The first argument for both find and find_all is the HTML element to get. The second argument takes a Python dictionary of attributes. You can retrieve text from all of the HTML elements you find with get_text.

Some amount of looping is usually necessary to process the elements and text, as with step 2 and step 3. These two statements in step 2 are fairly typical:

>>> theadrows = bs.find('table', {'id':'tblDeaths'}).thead.find_all('th')
>>> labelcols = [j.get_text() for j in theadrows]

The first statement finds all the th elements we want and creates a Beautiful Soup result set called theadrows from the elements it found. The second statement iterates over the theadrows Beautiful Soup result set using the get_text method to get the text from each element, and stores it in the labelcols list.

Step 3 is a little more involved, but makes use of the same Beautiful Soup methods. We find all of the table rows (tr) in the target table (rows = bs.find('table', {'id':'tblDeaths'}).tbody.find_all('tr')). We then iterate over each of those rows, finding the th element and getting the text in that element (rowlabels = row.find('th').get_text()). We also find all of the table cells (td) for each row (cells = row.find_all('td', {'class':'data'}) and get the text from all table cells (cellvalues = [j.get_text() for j in cells]). Note that this code is dependent on the class of the td elements being data. Finally, we insert the row labels we get from the th elements at the beginning of each list in datarows:

>>> for i in range(len(datarows)):
...   datarows[i].insert(0, labelrows[i])

In step 4, we use the DataFrame method to load the list we created in steps 2 and 3 into pandas. We then do some cleaning similar to what we have done in previous recipes in this chapter. We use string replace to remove spaces from column names and to remove all non-numeric data, including commas, from what are otherwise valid numeric values. We convert all columns, except for the rowheadings column, to numeric.

There's more…

Our scraping code is dependent on several aspects of the web page's structure not changing: the ID of the main table, the presence of th tags with column and row labels, and the td elements continuing to have their class equal to data. The good news is that if the structure of the web page does change, this will likely only affect the find and find_all calls. The rest of the code would not need to change.