Time Series Analysis with Python Cookbook

By : Tarek A. Atwan

Time Series Analysis with Python Cookbook

By: Tarek A. Atwan

Overview of this book

Time series data is everywhere, available at a high frequency and volume. It is complex and can contain noise, irregularities, and multiple patterns, making it crucial to be well-versed with the techniques covered in this book for data preparation, analysis, and forecasting. This book covers practical techniques for working with time series data, starting with ingesting time series data from various sources and formats, whether in private cloud storage, relational databases, non-relational databases, or specialized time series databases such as InfluxDB. Next, you’ll learn strategies for handling missing data, dealing with time zones and custom business days, and detecting anomalies using intuitive statistical methods, followed by more advanced unsupervised ML models. The book will also explore forecasting using classical statistical models such as Holt-Winters, SARIMA, and VAR. The recipes will present practical techniques for handling non-stationary data, using power transforms, ACF and PACF plots, and decomposing time series data with multiple seasonal patterns. Later, you’ll work with ML and DL models using TensorFlow and PyTorch. Finally, you’ll learn how to evaluate, compare, optimize models, and more using the recipes covered in the book.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Chapter 1: Getting Started with Time Series Analysis

Technical requirements

Development environment setup

Installing Python libraries

Installing JupyterLab and JupyterLab extensions

Free Chapter

Chapter 2: Reading Time Series Data from Files

Technical requirements

Reading data from CSVs and other delimited files

Reading data from an Excel file

Reading data from URLs

Reading data from a SAS dataset

Chapter 3: Reading Time Series Data from Databases

Technical requirements

Reading data from a relational database

Reading data from Snowflake

Reading data from a document database (MongoDB)

Reading third-party financial data using APIs

Reading data from a time series database (InfluxDB)

Chapter 4: Persisting Time Series Data to Files

Technical requirements

Serializing time series data with pickle

Writing to CSV and other delimited files

Writing data to an Excel file

Storing data to S3

Chapter 5: Persisting Time Series Data to Databases

Technical requirements

Writing time series data to a relational database (PostgreSQL and MySQL)

Writing time series data to MongoDB

Writing time series data to InfluxDB

Writing time series data to Snowflake

Chapter 6: Working with Date and Time in Python

Technical requirements

Working with DatetimeIndex

Providing a format argument to DateTime

Working with Unix epoch timestamps

Working with time deltas

Converting DateTime with time zone information

Working with date offsets

Working with custom business days

Chapter 7: Handling Missing Data

Technical requirements

Understanding missing data

Performing data quality checks

Handling missing data with univariate imputation using pandas

Handling missing data with univariate imputation using scikit-learn

Handling missing data with multivariate imputation

Handling missing data with interpolation

Chapter 8: Outlier Detection Using Statistical Methods

Technical requirements

Understanding outliers

Resampling time series data

Detecting outliers using visualizations

Detecting outliers using the Tukey method

Detecting outliers using a z-score

Detecting outliers using a modified z-score

Chapter 9: Exploratory Data Analysis and Diagnosis

Technical requirements

Plotting time series data using pandas

Plotting time series data with interactive visualizations using hvPlot

Decomposing time series data

Detecting time series stationarity

Applying power transformations

Testing for autocorrelation in time series data

Chapter 10: Building Univariate Time Series Models Using Statistical Methods

Technical requirements

Plotting ACF and PACF

Forecasting univariate time series data with exponential smoothing

Forecasting univariate time series data with non-seasonal ARIMA

Forecasting univariate time series data with seasonal ARIMA

Chapter 11: Additional Statistical Modeling Techniques for Time Series

Technical requirements

Forecasting time series data using auto_arima

Forecasting time series data using Facebook Prophet

Forecasting multivariate time series data using VAR

Evaluating vector autoregressive (VAR) models

Forecasting volatility in financial time series data with GARCH

Chapter 12: Forecasting Using Supervised Machine Learning

Technical requirements

Understanding supervised machine learning

Preparing time series data for supervised learning

One-step forecasting using linear regression models with scikit-learn

Multi-step forecasting using linear regression models with scikit-learn

Forecasting using non-linear models with sktime

Optimizing a forecasting model with hyperparameter tuning

Forecasting with exogenous variables and ensemble learning

Chapter 13: Deep Learning for Time Series Forecasting

Technical requirements

Understanding artificial neural networks

Forecasting with an RNN using Keras

Forecasting with LSTM using Keras

Forecasting with a GRU using Keras

Forecasting with an RNN using PyTorch

Forecasting with LSTM using PyTorch

Forecasting with a GRU using PyTorch

Chapter 14: Outlier Detection Using Unsupervised Machine Learning

Technical requirements

Detecting outliers using KNN

Detecting outliers using LOF

Detecting outliers using iForest

Detecting outliers using One-Class Support Vector Machine (OCSVM)

Detecting outliers using COPOD

Detecting outliers with PyCaret

Chapter 15: Advanced Techniques for Complex Time Series

Technical requirements

Understanding state-space models

Decomposing time series with multiple seasonal patterns using MSTL

Forecasting with multiple seasonal patterns using the Unobserved Components Model (UCM)

Forecasting time series with multiple seasonal patterns using Prophet

Forecasting time series with multiple seasonal patterns using NeuralProphet

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Reading data from a SAS dataset

In this recipe, you will read a SAS data file and, more specifically, a file with the SAS7BDAT extension. SAS is commercial statistical software that provides data mining, business intelligence, and advanced analytics capabilities. Many large organizations in various industries rely on SAS, so it is very common to encounter the need to read from a SAS dataset.

Getting ready

In this recipe, you will be using pandas to read a .sas7bdat file. These files can be extremely large, and you will be introduced to different ways to read such files more efficiently.

To get ready, you can download the SAS sample dataset from http://support.sas.com/kb/61/960.html. You will be reading the DCSKINPRODUCT.sas7bdat file.

The SAS data file is also provided in the GitHub repository for this book.

How to do it…

You will use the pandas.read_sas() function, which can be used to read both SAS XPORT (.xpt) and SAS7BDAT file formats. However, there is no SAS writer function in pandas:

Start by importing pandas and creating the path variable to the file. This file is not large (14.7 MB) compared to a typical SAS file, which can be 100+ GB:
```
import pandas as pd
path = '../../datasets/Ch2/DCSKINPRODUCT.sas7bdat'
```
One of the advantages of using pandas is that it provides data structures for in-memory analysis, hence the performance advantage when analyzing data. On the other hand, this can also be a constraint when loading large datasets into memory. Generally, the amount of data you can load is limited by the amount of memory available. However, this can be an issue if the dataset is too large and exceeds the amount of memory.

One way to tackle this issue is by using the chunksize parameter. The chunksize parameter is available in many reader and writer functions, including read_sas. The DCSKINPRODUCT.sas7bdat file contains 152130 records, so you will use a chunksize parameter to read 10000 records at a time:

df = pd.read_sas(path, chunksize=10000)
type(df)
>> pandas.io.sas.sas7bdat.SAS7BDATReader

The returned object is not a DataFrame but a SAS7BDATReader object. You can think of this as an iterator object that you can iterate through. At each iteration or chunk, you get a DataFrame of 10,000 rows at a time. You can retrieve the first chunk using the next() method that is, df.next(). Every time you use the next() method, it will retrieve the next batch or chunk (the next 10,000 rows). You can also loop through the chunks, for example, to do some computations. This can be helpful when the dataset is too large to fit in memory, allowing you to iterate through manageable chunks to do some heavy aggregations. The following code demonstrates this concept:
```
results = []
for chunk in df:
    results.append(
        chunk)
len(results)
>> 16
df = pd.concat(results)
df.shape
>> (152130, 5)
```

There were 16 chunks (DataFrames) in total; each chunk or DataFrame contained 10000 records. Using the concat function, you can combine all 16 DataDrames into a large DataFrame of 152130 records.

Reread the data in chunks, and this time group by DATE and aggregate using sum and count, as shown in the following:

df = pd.read_sas(path, chunksize=10000)
results = []
for chunk in df:
    results.append(
        chunk.groupby('DATE')['Revenue']
             .agg(['sum', 'count']))

The results object is now a list of DataFrames. Now, let's examine the result set:

results[0].loc['2013-02-10']
>>
sum      923903.0
count        91.0
Name: 2013-02-10 00:00:00, dtype: float64
results[1].loc['2013-02-10']
>>
sum      8186392.0
count         91.0
Name: 2013-02-10 00:00:00, dtype: float64
results[2].loc['2013-02-10']
>>
sum      5881396.0
count         91.0
Name: 2013-02-10 00:00:00, dtype: float64

From the preceding output, you can observe that we have another issue to solve. Notice that the observations for 2013-02-10 got split. This is a common issue when chunking since it splits the data disregarding their order or sequence.

You can resolve this by combining the results in a meaningful way. For example, you can use the reduce function in Python. The reduce function allows you to perform a rolling computation (also known as folding or reducing) based on some function you provide. The following code demonstrates how this can be implemented:

from functools import reduce
final = reduce(lambda x1, x2: x1.add(x2, fill_value=0), results)
type(final)
>> pandas.core.frame.DataFrame
final.loc['2013-02-10']
>>
sum      43104420.0
count        1383.0
Name: 2013-02-10 00:00:00, dtype: float64
final.shape
>> (110, 2)

From the preceding output, the 16 chunks or DataFrames were reduced to a single value per row (index). We leveraged the pandas.DataFrame.add() function to add the values and use zero (0) as a fill value when the data is missing.

How it works…

Using the chunksize parameter in the read_sas() function will not return a DataFrame but rather an iterator (a SAS7BDATReader object). The chunksize parameter is available in most reader functions in pandas, such as read_csv, read_hdf, and read_sql, to name a few. Similarly, using the chunkize parameter with those functions will also return an iterator.

If chunksize is not specified, the returned object would be a DataFrame of the entire dataset. This is because the default value is None in all the reader functions.

Chunking is great when the operation or workflow is simple and not sequential. An operation such as groupby can be complex and tricky when chunking, which is why we added two extra steps:

Stored the resulting DataFrame to a list.
Used Python's reduce() function, which takes two arguments, a function and an iterator. It then applies the function element-wise and does it cumulatively from left to right to reduce down to a one result set. We also leveraged the DataFrame's add() method, which matches DataFrame indices to perform an element-wise addition.

There's more…

There are better options when working with large files than using pandas, especially if you have memory constraints and cannot fit the entire data into the memory. Chunking is a great option, but it still has an overhead and relies on memory. The pandas library is a single-core framework and does not offer parallel computing capabilities. Instead, there are specialized libraries and frameworks for parallel processing designed to work with big data. Such frameworks do not rely on loading everything into memory and instead can utilize multiple CPU cores, disk usage, or expand into multiple worker nodes (think multiple machines). For example, Dask chunks your data, creates a computation graph, and parallelizes the smaller tasks (chunks) behind the scenes, thus speeding the overall processing time and reducing memory overhead.

These frameworks are great but will require you to spend time learning the framework and rewriting your code to leverage these capabilities. So, there is a steep learning curve initially. Luckily, this is where the Modin project comes into play. For example, the Modin library acts as a wrapper or, more specifically, an abstraction on top of Dask or Ray that uses a similar API to pandas. Modin makes optimizing your pandas' code much more straightforward without learning another framework, and all it takes is a single line of code.

Before installing any library, it is highly advised that you create a separate virtual environment, for example, using conda. The concept and purpose behind creating virtual environments were discussed in detail in Chapter 1, Getting Started with Time Series Analysis, with multiple examples.

To install Modin using Conda (with a Dask backend), run the following:

>> conda install -c conda-forge modin-dask

To install with Pip, use the following:

>> pip install modin[dask]

You will measure the time and memory usage using pandas and again using Modin. To measure memory usage, you will need to install the memory_profiler library.

>> pip install memory_profiler

The memory_profiler library provides IPython and Jupyter magics such as %memit and %mprun, similar to known magics such as %timeit and %time.

Start by loading the required libraries:

import memory_profiler 
import pandas as pd
%load_ext memory_profiler
path = '../../datasets/Ch2/large_file.csv'

You will start by using pandas to read the file large_file.csv:

%%time
%memit pd.read_csv(path).groupby('label_source').count()

The preceding code should output something similar to the following:

peak memory: 161.35 MiB, increment: 67.34 MiB
CPU times: user 364 ms, sys: 95.2 ms, total: 459 ms
Wall time: 1.03 s

Now, you will load Modin and specify Dask as the engine:

from modin.config import Engine
Engine.put("dask")  # Modin will use Dask
import modin.pandas as pd
from distributed import Client
client = Client()

Notice that in the preceding code that Modin has a pandas implementation. This way, you can leverage your existing code without modification. You will now rerun the same code:

%%time
%memit pd.read_csv(path).groupby('label_source').count()

The preceding code should produce an output similar to the following:

peak memory: 137.12 MiB, increment: 9.34 MiB
CPU times: user 899 ms, sys: 214 ms, total: 1.11 s
Wall time: 1.91 s

Observe how the peak memory was reduced from 160 MiB to 137.12 MiB using Modin (Dask). Most importantly, notice how the memory increment went down from 67 MiB to 9 MiB with Modin. Overall, with Modin, you got lower memory usage. However, Modin (Dask) will show more significant advantages with more extensive operations on larger datasets.

Time Series Analysis with Python Cookbook

By : Tarek A. Atwan

Time Series Analysis with Python Cookbook

By: Tarek A. Atwan

Overview of this book

Related Content you might be interested in

Current Title:

Time Series Analysis with Python Cookbook

Practical Time Series Analysis

Forecasting Time Series Data with Facebook Prophet

Codeless Time Series Analysis with KNIME

Reading data from a SAS dataset

Getting ready

How to do it…

How it works…

There's more…

See also