Book Image

Time Series Analysis with Python Cookbook

By : Tarek A. Atwan
Book Image

Time Series Analysis with Python Cookbook

By: Tarek A. Atwan

Overview of this book

Time series data is everywhere, available at a high frequency and volume. It is complex and can contain noise, irregularities, and multiple patterns, making it crucial to be well-versed with the techniques covered in this book for data preparation, analysis, and forecasting. This book covers practical techniques for working with time series data, starting with ingesting time series data from various sources and formats, whether in private cloud storage, relational databases, non-relational databases, or specialized time series databases such as InfluxDB. Next, you’ll learn strategies for handling missing data, dealing with time zones and custom business days, and detecting anomalies using intuitive statistical methods, followed by more advanced unsupervised ML models. The book will also explore forecasting using classical statistical models such as Holt-Winters, SARIMA, and VAR. The recipes will present practical techniques for handling non-stationary data, using power transforms, ACF and PACF plots, and decomposing time series data with multiple seasonal patterns. Later, you’ll work with ML and DL models using TensorFlow and PyTorch. Finally, you’ll learn how to evaluate, compare, optimize models, and more using the recipes covered in the book.
Table of Contents (18 chapters)

Reading data from URLs

Files can be downloaded and stored locally on your machine, or stored on a remote server or cloud location. In the earlier two recipes, Reading from CSVs and other delimited files, and Reading data from an Excel file, both files were stored locally.

Many of the pandas reader functions can read data from remote locations by passing a URL path. For example, both read_csv() and read_excel() can take a URL to read a file that is accessible via the internet. In this recipe, you will read a CSV file using pandas.read_csv() and Excel files using pandas.read_excel() from remote locations, such as GitHub and AWS S3 (private and public buckets). You will also read data directly from an HTML page into a pandas DataFrame.

Getting ready

You will need to install the AWS SDK for Python (Boto3) for reading files from S3 buckets. Additionally, you will learn how to use the storage_options parameter available in many of the reader functions in pandas to read from S3 without the Boto3 library.

To use an S3 URL (for example, s3://bucket_name/path-to-file) in pandas, you will need to install the s3fs library. You will also need to install an HTML parser for when we use read_html(). For example, for the parsing engine (the HTML parser), you can install either lxml or html5lib; pandas will pick whichever is installed (it will first look for lxml, and if that fails, then for html5lib). If you plan to use html5lib you will need to install Beautiful Soup (beautifulsoup4).

To install using pip, you can use the following command:

>>> pip install boto3 s3fs lxml

To install using Conda, you can use:

>>> conda install boto3 s3fs lxml -y

How to do it…

This recipe will present you with different scenarios when reading data from online (remote) sources. Let's import pandas upfront since you will be using it throughout this recipe:

import pandas as pd

Reading data from GitHub

Sometimes, you may find useful public data on GitHub that you want to use and read directly (without downloading). One of the most common file formats on GitHub are CSV files. Let's start with the following steps:

  1. To read a CSV file from GitHub, you will need the URL to the raw content. If you copy the file's GitHub URL from the browser and use it as the file path, you will get a URL that looks like this: https://github.com/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./blob/main/datasets/Ch2/AirQualityUCI.csv. This URL is a pointer to the web page in GitHub and not the data itself; hence when using pd.read_csv(), it will throw an error:
    url = 'https://github.com/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./blob/main/datasets/Ch2/AirQualityUCI.csv'
    pd.read_csv(url)
    ParserError: Error tokenizing data. C error: Expected 1 fields in line 62, saw 2
  2. Instead, you will need the raw content, which will give you a URL that looks like this: https://raw.githubusercontent.com/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./main/datasets/Ch2/AirQualityUCI.csv:
Figure 2.4 – The GitHub page for the CSV file. Note the View raw button

Figure 2.4 – The GitHub page for the CSV file. Note the View raw button

  1. In Figure 2.4, notice that the values are not comma-separated (not a comma-delimited file); instead, the file uses semicolon (;) to separate the values.

The first column in the file is the Date column. You will need to parse (parse_date parameter) and convert it to DatetimeIndex (index_col parameter).

Pass the new URL to pandas.read_csv():

url = 'https://media.githubusercontent.com/media/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./main/datasets/Ch2/AirQualityUCI.csv'
date_parser = lambda x: pd.to_datetime(x, format="%d/%m/%Y")
df = pd.read_csv(url,
                 delimiter=';',
                 index_col='Date',
                 date_parser=date_parser)
df.iloc[:3,1:4]
>>
            CO(GT)  PT08.S1(CO)  NMHC(GT)
Date
2004-03-10     2.6      1360.00       150
2004-03-10     2.0      1292.25       112
2004-03-10     2.2      1402.00        88

We successfully ingested the data from the CSV file in GitHub into a DataFrame and printed the first three rows of select columns.

Reading data from a public S3 bucket

AWS supports virtual-hosted-style URLs such as https://bucket-name.s3.Region.amazonaws.com/keyname, path-style URLs such as https://s3.Region.amazonaws.com/bucket-name/keyname, and using S3://bucket/keyname. Here are examples of how these different URLs may look for our file:

In this example, you will be reading the AirQualityUCI.xlsx file, which has only one sheet. It contains the same data as AirQualityUCI.csv, which we read earlier from GitHub.

Note that in the URL, you do not need to specify the region as us-east-1. The us-east-1 region, which represents US East (North Virginia), is an exception and will not be the case for other regions:

url = 'https://tscookbook.s3.amazonaws.com/AirQualityUCI.xlsx'
df = pd.read_excel(url,
                   index_col='Date',
                   parse_dates=True)

Read the same file using the S3:// URL:

s3uri = 's3://tscookbook/AirQualityUCI.xlsx'
df = pd.read_excel(s3uri,
                   index_col='Date',
                   parse_dates=True)

You may get an error such as the following:

ImportError: Install s3fs to access S3

This indicates that either you do not have the s3fs library installed or possibly you are not using the right Python/Conda environment.

Reading data from a private S3 bucket

When reading files from a private S3 bucket, you will need to pass your credentials to authenticate. A convenient parameter in many of the I/O functions in pandas is storage_options, which allows you to send additional content with the request, such as a custom header or required credentials to a cloud service.

You will need to pass a dictionary (key-value pair) to provide the additional information along with the request, such as username, password, access keys, and secret keys to storage_options as in {"username": username, "password": password}.

Now, you will read the AirQualityUCI.csv file, located in a private S3 bucket:

  1. You will start by storing your AWS credentials in a config .cfg file outside your Python script. Then, use configparser to read the values and store them in Python variables. You do not want your credentials exposed or hardcoded in your code:
    # Example aws.cfg file
    [AWS]
    aws_access_key=your_access_key
    aws_secret_key=your_secret_key

You can load the aws.cfg file using config.read():

import configparser
config = configparser.ConfigParser()
config.read('aws.cfg')
AWS_ACCESS_KEY = config['AWS']['aws_access_key']
AWS_SECRET_KEY = config['AWS']['aws_secret_key']
  1. The AWS Access Key ID and Secret Access Key are now stored in AWS_ACCESS_KEY and AWS_SECRET_KEY. Use pandas.read_csv() to read the CSV file and update the storage_options parameter by passing your credentials, as shown in the following code:
    s3uri = "s3://tscookbook-private/AirQuality.csv"
    df = pd.read_csv(s3uri,
                     index_col='Date',
                     parse_dates=True,
                     storage_options= {
                             'key': AWS_ACCESS_KEY,
                             'secret': AWS_SECRET_KEY
                         })
    df.iloc[:3, 1:4]
    >>
               CO(GT)  PT08.S1(CO)  NMHC(GT)
    Date
    2004-10-03      2.6       1360.0     150.0
    2004-10-03      2.0       1292.0     112.0
    2004-10-03      2.2       1402.0      88.0
  2. Alternatively, you can use the AWS SDK for Python (Boto3) to achieve similar results. The boto3 Python library gives you more control and additional capabilities (beyond just reading and writing to S3). You will pass the same credentials stored earlier in AWS_ACCESS_KEY and AWS_SECRET_KEY and pass them to AWS, using boto3 to authenticate:
    import boto3
    bucket = "tscookbook-private"
    client = boto3.client("s3",
                      aws_access_key_id =AWS_ACCESS_KEY,
                      aws_secret_access_key = AWS_SECRET_KEY)

Now, the client object has access to many methods specific to the AWS S3 service for creating, deleting, and retrieving bucket information, and more. In addition, Boto3 offers two levels of APIs: client and resource. In the preceding example, you used the client API.

The client is a low-level service access interface that gives you more granular control, for example, boto3.client("s3"). The resource is a higher-level object-oriented interface (an abstraction layer), for example, boto3.resource("s3").

In Chapter 4, Persisting Time Series Data to Files, you will explore the resource API interface when writing to S3. For now, you will use the client interface.

  1. You will use the get_object method to retrieve the data. Just provide the bucket name and a key. The key here is the actual filename:
    data = client.get_object(Bucket=bucket, Key='AirQuality.csv')
    df = pd.read_csv(data['Body'],
                     index_col='Date',
                     parse_dates=True)
         
    df.iloc[:3, 1:4]
    >>
               CO(GT)  PT08.S1(CO)  NMHC(GT)
    Date
    2004-10-03    2,6       1360.0     150.0
    2004-10-03      2       1292.0     112.0
    2004-10-03    2,2       1402.0      88.0

When calling the client.get_object() method, a dictionary (key-value pair) is returned, as shown in the following example:

{'ResponseMetadata': {
'RequestId':'MM0CR3XX5QFBQTSG',
'HostId':'vq8iRCJfuA4eWPgHBGhdjir1x52Tdp80ADaSxWrL4Xzsr
VpebSZ6SnskPeYNKCOd/RZfIRT4xIM=',
'HTTPStatusCode':200,
'HTTPHeaders': {'x-amz-id-2': 'vq8iRCJfuA4eWPgHBGhdjir1x52
Tdp80ADaSxWrL4XzsrVpebSZ6SnskPeYNKCOd/RZfIRT4xIM=',
   'x-amz-request-id': 'MM0CR3XX5QFBQTSG',
   'date': 'Tue, 06 Jul 2021 01:08:36 GMT',
   'last-modified': 'Mon, 14 Jun 2021 01:13:05 GMT',
   'etag': '"2ce337accfeb2dbbc6b76833bc6f84b8"',
   'accept-ranges': 'bytes',
   'content-type': 'binary/octet-stream',
   'server': 'AmazonS3',
   'content-length': '1012427'},
   'RetryAttempts': 0},
   'AcceptRanges': 'bytes',
 'LastModified': datetime.datetime(2021, 6, 14, 1, 13, 5, tzinfo=tzutc()),
 'ContentLength': 1012427,
 'ETag': '"2ce337accfeb2dbbc6b76833bc6f84b8"',
 'ContentType': 'binary/octet-stream',
 'Metadata': {},
 'Body': <botocore.response.StreamingBody at 0x7fe9c16b55b0>}

The content you are interested in is in the response body under the Body key. You passed data['Body'] to the read_csv() function, which loads the response stream (StreamingBody) into a DataFrame.

Reading data from HTML

pandas offers an elegant way to read HTML tables and convert the content into a pandas DataFrame using the pandas.read_html() function:

  1. In the following recipe, we will extract HTML tables from Wikipedia for COVID-19 pandemic tracking cases by country and by territory (https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory):
    url = "https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory"
    results = pd.read_html(url)
    print(len(results))
    >> 71
  2. pandas.read_html() returned a list of DataFrames, one for each HTML table found in the URL. Keep in mind that the website's content is dynamic and gets updated regularly, and the results may vary. In our case, it returned 71 DataFrames. The DataFrame at index 15 contains summary on COVID-19 cases and deaths by region. Grab the DataFrame (at index 15) and assign it to the df variable, and print the returned columns:
    df = results[15]
    df.columns
    >> Index(['Region[28]', 'Total cases', 'Total deaths', 'Cases per million',
           'Deaths per million', 'Current weekly cases', 'Current weekly deaths',
           'Population millions', 'Vaccinated %[29]'],
          dtype='object')
  3. Display the first five rows for Total cases, Total deaths, and the Cases per million columns.
    df[['Total cases', 'Total deaths', 'Cases per million']].head()
    >>
         Total cases    Total deaths    Cases per million
    0    139300788      1083815         311412
    1    85476396       1035884         231765
    2    51507114       477420          220454
    3    56804073       1270477         132141
    4    21971862       417507          92789

How it works…

Most of the pandas reader functions accept a URL as a path. Examples include the following:

  • pandas.read_csv()
  • pandas.read_excel()
  • pandas.read_parquet()
  • pandas.read_table()
  • pandas.read_pickle()
  • pandas.read_orc()
  • pandas.read_stata()
  • pandas.read_sas()
  • pandas.read_json()

The URL needs to be one of the valid URL schemes that pandas supports, which includes http and https, ftp, s3, gs, or the file protocol.

The read_html() function is great for scraping websites that contain data in HTML tables. It inspects the HTML and searches for all the <table> elements within the HTML. In HTML, table rows are defined with the <tr> </tr> tags and headers with the <th></th> tags. The actual data (cell) is contained within the <td> </td> tags. The read_html() function looks for <table>, <tr>, <th>, and <td> tags and converts the content into a DataFrame, and assigns the columns and rows as they were defined in the HTML. If an HTML page contains more than one <table></table> tag, read_html will return them all and you will get a list of DataFrames.

The following code demonstrates how pandas.read_html()works:

import pandas as pd
html = """
 <table>
   <tr>
     <th>Ticker</th>
     <th>Price</th>
   </tr>
   <tr>
     <td>MSFT</td>
     <td>230</td>
   </tr>
   <tr>
     <td>APPL</td>
     <td>300</td>
   </tr>
     <tr>
     <td>MSTR</td>
     <td>120</td>
   </tr>
 </table>
 </body>
 </html>
 """
   
df = pd.read_html(html)
df[0]
>>
  Ticker  Price
0   MSFT    230
1   APPL    300
2   MSTR    120

In the preceding code, the read_html() function parsed the HTML code and converted the HTML table into a pandas DataFrame. The headers between the <th> and </th> tags represent the column names of the DataFrame, and the content between the <tr><td> and </td></tr> tags represent the row data of the DataFrame. Note that if you go ahead and delete the <table> and </table> table tags, you will get the ValueError: No tables found error.

There's more…

The read_html() function has an optional attr argument, which takes a dictionary of valid HTML <table> attributes, such as id or class. For example, you can use the attr parameter to narrow down the tables returned to those that match the class attribute sortable as in <table class="sortable">. The read_html function will inspect the entire HTML page to ensure you target the right set of attributes.

In the previous exercise, you used the read_html function on the COVID-19 Wikipedia page, and it returned 71 tables (DataFrames). The number of tables will probably increase as time goes by as Wikipedia gets updated. You can narrow down the result set and guarantee some consistency by using the attr option. First, start by inspecting the HTML code using your browser. You will see that several of the <table> elements have multiple classes listed, such as sortable. You can look for other unique identifiers.

<table class="wikitable sortable mw-datatable covid19-countrynames jquery-tablesorter" id="thetable" style="text-align:right;">

Note, if you get the error html5lib not found, please install it you will need to install both html5lib and beautifulSoup4.

To install using conda, use the following:

conda install html5lib beautifulSoup4

To install using pip, use the following:

pip install html5lib beautifulSoup4

Now, let's use the sortable class and request the data again:

url = "https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory"
df = pd.read_html(url, attrs={'class': 'sortable'})
len(df)
>>  7
df[3].columns
>>
Index(['Region[28]', 'Total cases', 'Total deaths', 'Cases per million',
       'Deaths per million', 'Current weekly cases', 'Current weekly deaths',
       'Population millions', 'Vaccinated %[29]'],
      dtype='object')

The list returned a smaller subset of tables (from 71 down to 7).

See also

For more information, please refer to the official pandas.read_html documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_html.html.