Time Series Analysis with Python Cookbook

By : Tarek A. Atwan

Time Series Analysis with Python Cookbook

By: Tarek A. Atwan

Overview of this book

Time series data is everywhere, available at a high frequency and volume. It is complex and can contain noise, irregularities, and multiple patterns, making it crucial to be well-versed with the techniques covered in this book for data preparation, analysis, and forecasting. This book covers practical techniques for working with time series data, starting with ingesting time series data from various sources and formats, whether in private cloud storage, relational databases, non-relational databases, or specialized time series databases such as InfluxDB. Next, you’ll learn strategies for handling missing data, dealing with time zones and custom business days, and detecting anomalies using intuitive statistical methods, followed by more advanced unsupervised ML models. The book will also explore forecasting using classical statistical models such as Holt-Winters, SARIMA, and VAR. The recipes will present practical techniques for handling non-stationary data, using power transforms, ACF and PACF plots, and decomposing time series data with multiple seasonal patterns. Later, you’ll work with ML and DL models using TensorFlow and PyTorch. Finally, you’ll learn how to evaluate, compare, optimize models, and more using the recipes covered in the book.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Chapter 1: Getting Started with Time Series Analysis

Technical requirements

Development environment setup

Installing Python libraries

Installing JupyterLab and JupyterLab extensions

Free Chapter

Chapter 2: Reading Time Series Data from Files

Technical requirements

Reading data from CSVs and other delimited files

Reading data from an Excel file

Reading data from URLs

Reading data from a SAS dataset

Chapter 3: Reading Time Series Data from Databases

Technical requirements

Reading data from a relational database

Reading data from Snowflake

Reading data from a document database (MongoDB)

Reading third-party financial data using APIs

Reading data from a time series database (InfluxDB)

Chapter 4: Persisting Time Series Data to Files

Technical requirements

Serializing time series data with pickle

Writing to CSV and other delimited files

Writing data to an Excel file

Storing data to S3

Chapter 5: Persisting Time Series Data to Databases

Technical requirements

Writing time series data to a relational database (PostgreSQL and MySQL)

Writing time series data to MongoDB

Writing time series data to InfluxDB

Writing time series data to Snowflake

Chapter 6: Working with Date and Time in Python

Technical requirements

Working with DatetimeIndex

Providing a format argument to DateTime

Working with Unix epoch timestamps

Working with time deltas

Converting DateTime with time zone information

Working with date offsets

Working with custom business days

Chapter 7: Handling Missing Data

Technical requirements

Understanding missing data

Performing data quality checks

Handling missing data with univariate imputation using pandas

Handling missing data with univariate imputation using scikit-learn

Handling missing data with multivariate imputation

Handling missing data with interpolation

Chapter 8: Outlier Detection Using Statistical Methods

Technical requirements

Understanding outliers

Resampling time series data

Detecting outliers using visualizations

Detecting outliers using the Tukey method

Detecting outliers using a z-score

Detecting outliers using a modified z-score

Chapter 9: Exploratory Data Analysis and Diagnosis

Technical requirements

Plotting time series data using pandas

Plotting time series data with interactive visualizations using hvPlot

Decomposing time series data

Detecting time series stationarity

Applying power transformations

Testing for autocorrelation in time series data

Chapter 10: Building Univariate Time Series Models Using Statistical Methods

Technical requirements

Plotting ACF and PACF

Forecasting univariate time series data with exponential smoothing

Forecasting univariate time series data with non-seasonal ARIMA

Forecasting univariate time series data with seasonal ARIMA

Chapter 11: Additional Statistical Modeling Techniques for Time Series

Technical requirements

Forecasting time series data using auto_arima

Forecasting time series data using Facebook Prophet

Forecasting multivariate time series data using VAR

Evaluating vector autoregressive (VAR) models

Forecasting volatility in financial time series data with GARCH

Chapter 12: Forecasting Using Supervised Machine Learning

Technical requirements

Understanding supervised machine learning

Preparing time series data for supervised learning

One-step forecasting using linear regression models with scikit-learn

Multi-step forecasting using linear regression models with scikit-learn

Forecasting using non-linear models with sktime

Optimizing a forecasting model with hyperparameter tuning

Forecasting with exogenous variables and ensemble learning

Chapter 13: Deep Learning for Time Series Forecasting

Technical requirements

Understanding artificial neural networks

Forecasting with an RNN using Keras

Forecasting with LSTM using Keras

Forecasting with a GRU using Keras

Forecasting with an RNN using PyTorch

Forecasting with LSTM using PyTorch

Forecasting with a GRU using PyTorch

Chapter 14: Outlier Detection Using Unsupervised Machine Learning

Technical requirements

Detecting outliers using KNN

Detecting outliers using LOF

Detecting outliers using iForest

Detecting outliers using One-Class Support Vector Machine (OCSVM)

Detecting outliers using COPOD

Detecting outliers with PyCaret

Chapter 15: Advanced Techniques for Complex Time Series

Technical requirements

Understanding state-space models

Decomposing time series with multiple seasonal patterns using MSTL

Forecasting with multiple seasonal patterns using the Unobserved Components Model (UCM)

Forecasting time series with multiple seasonal patterns using Prophet

Forecasting time series with multiple seasonal patterns using NeuralProphet

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Reading data from URLs

Files can be downloaded and stored locally on your machine, or stored on a remote server or cloud location. In the earlier two recipes, Reading from CSVs and other delimited files, and Reading data from an Excel file, both files were stored locally.

Many of the pandas reader functions can read data from remote locations by passing a URL path. For example, both read_csv() and read_excel() can take a URL to read a file that is accessible via the internet. In this recipe, you will read a CSV file using pandas.read_csv() and Excel files using pandas.read_excel() from remote locations, such as GitHub and AWS S3 (private and public buckets). You will also read data directly from an HTML page into a pandas DataFrame.

Getting ready

You will need to install the AWS SDK for Python (Boto3) for reading files from S3 buckets. Additionally, you will learn how to use the storage_options parameter available in many of the reader functions in pandas to read from S3 without the Boto3 library.

To use an S3 URL (for example, s3://bucket_name/path-to-file) in pandas, you will need to install the s3fs library. You will also need to install an HTML parser for when we use read_html(). For example, for the parsing engine (the HTML parser), you can install either lxml or html5lib; pandas will pick whichever is installed (it will first look for lxml, and if that fails, then for html5lib). If you plan to use html5lib you will need to install Beautiful Soup (beautifulsoup4).

To install using pip, you can use the following command:

>>> pip install boto3 s3fs lxml

To install using Conda, you can use:

>>> conda install boto3 s3fs lxml -y

How to do it…

This recipe will present you with different scenarios when reading data from online (remote) sources. Let's import pandas upfront since you will be using it throughout this recipe:

import pandas as pd

Reading data from GitHub

Sometimes, you may find useful public data on GitHub that you want to use and read directly (without downloading). One of the most common file formats on GitHub are CSV files. Let's start with the following steps:

To read a CSV file from GitHub, you will need the URL to the raw content. If you copy the file's GitHub URL from the browser and use it as the file path, you will get a URL that looks like this: https://github.com/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./blob/main/datasets/Ch2/AirQualityUCI.csv. This URL is a pointer to the web page in GitHub and not the data itself; hence when using pd.read_csv(), it will throw an error:
```
url = 'https://github.com/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./blob/main/datasets/Ch2/AirQualityUCI.csv'
pd.read_csv(url)
ParserError: Error tokenizing data. C error: Expected 1 fields in line 62, saw 2
```
Instead, you will need the raw content, which will give you a URL that looks like this: https://raw.githubusercontent.com/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./main/datasets/Ch2/AirQualityUCI.csv:

Figure 2.4 – The GitHub page for the CSV file. Note the View raw button

In Figure 2.4, notice that the values are not comma-separated (not a comma-delimited file); instead, the file uses semicolon (;) to separate the values.

The first column in the file is the Date column. You will need to parse (parse_date parameter) and convert it to DatetimeIndex (index_col parameter).

Pass the new URL to pandas.read_csv():

url = 'https://media.githubusercontent.com/media/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./main/datasets/Ch2/AirQualityUCI.csv'
date_parser = lambda x: pd.to_datetime(x, format="%d/%m/%Y")
df = pd.read_csv(url,
                 delimiter=';',
                 index_col='Date',
                 date_parser=date_parser)
df.iloc[:3,1:4]
>>
            CO(GT)  PT08.S1(CO)  NMHC(GT)
Date
2004-03-10     2.6      1360.00       150
2004-03-10     2.0      1292.25       112
2004-03-10     2.2      1402.00        88

We successfully ingested the data from the CSV file in GitHub into a DataFrame and printed the first three rows of select columns.

Reading data from a public S3 bucket

AWS supports virtual-hosted-style URLs such as https://bucket-name.s3.Region.amazonaws.com/keyname, path-style URLs such as https://s3.Region.amazonaws.com/bucket-name/keyname, and using S3://bucket/keyname. Here are examples of how these different URLs may look for our file:

A virtual hosted-style URL or an object URL: https://tscookbook.s3.us-east-1.amazonaws.com/AirQualityUCI.xlsx
A path-style URL: https://s3.us-east-1.amazonaws.com/tscookbook/AirQualityUCI.xlsx
An S3 protocol: s3://tscookbook/AirQualityUCI.csv

In this example, you will be reading the AirQualityUCI.xlsx file, which has only one sheet. It contains the same data as AirQualityUCI.csv, which we read earlier from GitHub.

Note that in the URL, you do not need to specify the region as us-east-1. The us-east-1 region, which represents US East (North Virginia), is an exception and will not be the case for other regions:

url = 'https://tscookbook.s3.amazonaws.com/AirQualityUCI.xlsx'
df = pd.read_excel(url,
                   index_col='Date',
                   parse_dates=True)

Read the same file using the S3:// URL:

s3uri = 's3://tscookbook/AirQualityUCI.xlsx'
df = pd.read_excel(s3uri,
                   index_col='Date',
                   parse_dates=True)

You may get an error such as the following:

ImportError: Install s3fs to access S3

This indicates that either you do not have the s3fs library installed or possibly you are not using the right Python/Conda environment.

Reading data from a private S3 bucket

When reading files from a private S3 bucket, you will need to pass your credentials to authenticate. A convenient parameter in many of the I/O functions in pandas is storage_options, which allows you to send additional content with the request, such as a custom header or required credentials to a cloud service.

You will need to pass a dictionary (key-value pair) to provide the additional information along with the request, such as username, password, access keys, and secret keys to storage_options as in {"username": username, "password": password}.

Now, you will read the AirQualityUCI.csv file, located in a private S3 bucket:

You will start by storing your AWS credentials in a config .cfg file outside your Python script. Then, use configparser to read the values and store them in Python variables. You do not want your credentials exposed or hardcoded in your code:
```
# Example aws.cfg file
[AWS]
aws_access_key=your_access_key
aws_secret_key=your_secret_key
```

You can load the aws.cfg file using config.read():

import configparser
config = configparser.ConfigParser()
config.read('aws.cfg')
AWS_ACCESS_KEY = config['AWS']['aws_access_key']
AWS_SECRET_KEY = config['AWS']['aws_secret_key']

The AWS Access Key ID and Secret Access Key are now stored in AWS_ACCESS_KEY and AWS_SECRET_KEY. Use pandas.read_csv() to read the CSV file and update the storage_options parameter by passing your credentials, as shown in the following code:

s3uri = "s3://tscookbook-private/AirQuality.csv"
df = pd.read_csv(s3uri,
                 index_col='Date',
                 parse_dates=True,
                 storage_options= {
                         'key': AWS_ACCESS_KEY,
                         'secret': AWS_SECRET_KEY
                     })
df.iloc[:3, 1:4]
>>
           CO(GT)  PT08.S1(CO)  NMHC(GT)
Date
2004-10-03      2.6       1360.0     150.0
2004-10-03      2.0       1292.0     112.0
2004-10-03      2.2       1402.0      88.0

Alternatively, you can use the AWS SDK for Python (Boto3) to achieve similar results. The boto3 Python library gives you more control and additional capabilities (beyond just reading and writing to S3). You will pass the same credentials stored earlier in AWS_ACCESS_KEY and AWS_SECRET_KEY and pass them to AWS, using boto3 to authenticate:
```
import boto3
bucket = "tscookbook-private"
client = boto3.client("s3",
                  aws_access_key_id =AWS_ACCESS_KEY,
                  aws_secret_access_key = AWS_SECRET_KEY)
```

Now, the client object has access to many methods specific to the AWS S3 service for creating, deleting, and retrieving bucket information, and more. In addition, Boto3 offers two levels of APIs: client and resource. In the preceding example, you used the client API.

The client is a low-level service access interface that gives you more granular control, for example, boto3.client("s3"). The resource is a higher-level object-oriented interface (an abstraction layer), for example, boto3.resource("s3").

In Chapter 4, Persisting Time Series Data to Files, you will explore the resource API interface when writing to S3. For now, you will use the client interface.

You will use the get_object method to retrieve the data. Just provide the bucket name and a key. The key here is the actual filename:

data = client.get_object(Bucket=bucket, Key='AirQuality.csv')
df = pd.read_csv(data['Body'],
                 index_col='Date',
                 parse_dates=True)
     
df.iloc[:3, 1:4]
>>
           CO(GT)  PT08.S1(CO)  NMHC(GT)
Date
2004-10-03    2,6       1360.0     150.0
2004-10-03      2       1292.0     112.0
2004-10-03    2,2       1402.0      88.0

When calling the client.get_object() method, a dictionary (key-value pair) is returned, as shown in the following example:

{'ResponseMetadata': {
'RequestId':'MM0CR3XX5QFBQTSG',
'HostId':'vq8iRCJfuA4eWPgHBGhdjir1x52Tdp80ADaSxWrL4Xzsr
VpebSZ6SnskPeYNKCOd/RZfIRT4xIM=',
'HTTPStatusCode':200,
'HTTPHeaders': {'x-amz-id-2': 'vq8iRCJfuA4eWPgHBGhdjir1x52
Tdp80ADaSxWrL4XzsrVpebSZ6SnskPeYNKCOd/RZfIRT4xIM=',
   'x-amz-request-id': 'MM0CR3XX5QFBQTSG',
   'date': 'Tue, 06 Jul 2021 01:08:36 GMT',
   'last-modified': 'Mon, 14 Jun 2021 01:13:05 GMT',
   'etag': '"2ce337accfeb2dbbc6b76833bc6f84b8"',
   'accept-ranges': 'bytes',
   'content-type': 'binary/octet-stream',
   'server': 'AmazonS3',
   'content-length': '1012427'},
   'RetryAttempts': 0},
   'AcceptRanges': 'bytes',
 'LastModified': datetime.datetime(2021, 6, 14, 1, 13, 5, tzinfo=tzutc()),
 'ContentLength': 1012427,
 'ETag': '"2ce337accfeb2dbbc6b76833bc6f84b8"',
 'ContentType': 'binary/octet-stream',
 'Metadata': {},
 'Body': <botocore.response.StreamingBody at 0x7fe9c16b55b0>}

The content you are interested in is in the response body under the Body key. You passed data['Body'] to the read_csv() function, which loads the response stream (StreamingBody) into a DataFrame.

Reading data from HTML

pandas offers an elegant way to read HTML tables and convert the content into a pandas DataFrame using the pandas.read_html() function:

In the following recipe, we will extract HTML tables from Wikipedia for COVID-19 pandemic tracking cases by country and by territory (https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory):
```
url = "https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory"
results = pd.read_html(url)
print(len(results))
>> 71
```
pandas.read_html() returned a list of DataFrames, one for each HTML table found in the URL. Keep in mind that the website's content is dynamic and gets updated regularly, and the results may vary. In our case, it returned 71 DataFrames. The DataFrame at index 15 contains summary on COVID-19 cases and deaths by region. Grab the DataFrame (at index 15) and assign it to the df variable, and print the returned columns:
```
df = results[15]
df.columns
>> Index(['Region[28]', 'Total cases', 'Total deaths', 'Cases per million',
       'Deaths per million', 'Current weekly cases', 'Current weekly deaths',
       'Population millions', 'Vaccinated %[29]'],
      dtype='object')
```

Display the first five rows for Total cases, Total deaths, and the Cases per million columns.

df[['Total cases', 'Total deaths', 'Cases per million']].head()
>>
     Total cases    Total deaths    Cases per million
0    139300788      1083815         311412
1    85476396       1035884         231765
2    51507114       477420          220454
3    56804073       1270477         132141
4    21971862       417507          92789

How it works…

Most of the pandas reader functions accept a URL as a path. Examples include the following:

pandas.read_csv()
pandas.read_excel()
pandas.read_parquet()
pandas.read_table()
pandas.read_pickle()
pandas.read_orc()
pandas.read_stata()
pandas.read_sas()
pandas.read_json()

The URL needs to be one of the valid URL schemes that pandas supports, which includes http and https, ftp, s3, gs, or the file protocol.

The read_html() function is great for scraping websites that contain data in HTML tables. It inspects the HTML and searches for all the <table> elements within the HTML. In HTML, table rows are defined with the <tr> </tr> tags and headers with the <th></th> tags. The actual data (cell) is contained within the <td> </td> tags. The read_html() function looks for <table>, <tr>, <th>, and <td> tags and converts the content into a DataFrame, and assigns the columns and rows as they were defined in the HTML. If an HTML page contains more than one <table></table> tag, read_html will return them all and you will get a list of DataFrames.

The following code demonstrates how pandas.read_html()works:

import pandas as pd
html = """
 <table>
   <tr>
     <th>Ticker</th>
     <th>Price</th>
   </tr>
   <tr>
     <td>MSFT</td>
     <td>230</td>
   </tr>
   <tr>
     <td>APPL</td>
     <td>300</td>
   </tr>
     <tr>
     <td>MSTR</td>
     <td>120</td>
   </tr>
 </table>
 </body>
 </html>
 """
   
df = pd.read_html(html)
df[0]
>>
  Ticker  Price
0   MSFT    230
1   APPL    300
2   MSTR    120

In the preceding code, the read_html() function parsed the HTML code and converted the HTML table into a pandas DataFrame. The headers between the <th> and </th> tags represent the column names of the DataFrame, and the content between the <tr><td> and </td></tr> tags represent the row data of the DataFrame. Note that if you go ahead and delete the <table> and </table> table tags, you will get the ValueError: No tables found error.

There's more…

The read_html() function has an optional attr argument, which takes a dictionary of valid HTML <table> attributes, such as id or class. For example, you can use the attr parameter to narrow down the tables returned to those that match the class attribute sortable as in <table class="sortable">. The read_html function will inspect the entire HTML page to ensure you target the right set of attributes.

In the previous exercise, you used the read_html function on the COVID-19 Wikipedia page, and it returned 71 tables (DataFrames). The number of tables will probably increase as time goes by as Wikipedia gets updated. You can narrow down the result set and guarantee some consistency by using the attr option. First, start by inspecting the HTML code using your browser. You will see that several of the <table> elements have multiple classes listed, such as sortable. You can look for other unique identifiers.

<table class="wikitable sortable mw-datatable covid19-countrynames jquery-tablesorter" id="thetable" style="text-align:right;">

Note, if you get the error html5lib not found, please install it you will need to install both html5lib and beautifulSoup4.

To install using conda, use the following:

conda install html5lib beautifulSoup4

To install using pip, use the following:

pip install html5lib beautifulSoup4

Now, let's use the sortable class and request the data again:

url = "https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory"
df = pd.read_html(url, attrs={'class': 'sortable'})
len(df)
>>  7
df[3].columns
>>
Index(['Region[28]', 'Total cases', 'Total deaths', 'Cases per million',
       'Deaths per million', 'Current weekly cases', 'Current weekly deaths',
       'Population millions', 'Vaccinated %[29]'],
      dtype='object')

The list returned a smaller subset of tables (from 71 down to 7).

Time Series Analysis with Python Cookbook

By : Tarek A. Atwan

Time Series Analysis with Python Cookbook

By: Tarek A. Atwan

Overview of this book

Related Content you might be interested in

Current Title:

Time Series Analysis with Python Cookbook

Practical Time Series Analysis

Forecasting Time Series Data with Facebook Prophet

Codeless Time Series Analysis with KNIME

Reading data from URLs

Getting ready

How to do it…

Reading data from GitHub

Reading data from a public S3 bucket

Reading data from a private S3 bucket

Reading data from HTML

How it works…

There's more…

See also