Book Image

Machine Learning for Time-Series with Python

By : Ben Auffarth
Book Image

Machine Learning for Time-Series with Python

By: Ben Auffarth

Overview of this book

The Python time-series ecosystem is huge and often quite hard to get a good grasp on, especially for time-series since there are so many new libraries and new models. This book aims to deepen your understanding of time series by providing a comprehensive overview of popular Python time-series packages and help you build better predictive systems. Machine Learning for Time-Series with Python starts by re-introducing the basics of time series and then builds your understanding of traditional autoregressive models as well as modern non-parametric models. By observing practical examples and the theory behind them, you will become confident with loading time-series datasets from any source, deep learning models like recurrent neural networks and causal convolutional network models, and gradient boosting with feature engineering. This book will also guide you in matching the right model to the right problem by explaining the theory behind several useful models. You’ll also have a look at real-world case studies covering weather, traffic, biking, and stock market data. By the end of this book, you should feel at home with effectively analyzing and applying machine learning methods to time-series.
Table of Contents (15 chapters)
13
Other Books You May Enjoy
14
Index

What is time-series analysis?

The term time-series analysis (TSA) refers to the statistical approach to time-series or the analysis of trend and seasonality. It is often an ad hoc exploration and analysis that usually involves visualizing distributions, trends, cyclic patterns, and relationships between features, and between features and the target(s).

More generally, we can say TSA is roughly exploratory data analysis (EDA) that's specific to time-series data. This comparison can be misleading however since TSA can include both descriptive and exploratory elements.

Let's see quickly the differences between descriptive and exploratory analysis:

  • Descriptive analysis summarizes characteristics of a dataset
  • Exploratory analysis analyzes for patterns, trends, or relationships between variables

Therefore, TSA is the initial investigation of a dataset with the goal of discovering patterns, especially trend and seasonality, and obtaining initial insights, testing hypotheses, and extracting meaningful summary statistics.

Definition: Time-Series Analysis (TSA) is the process of extracting a summary and other statistical information from time-series, most importantly, the analysis of trend and seasonality.

Since an important part of TSA is gathering statistics and representing your dataset graphically through visualization, we'll do a lot of plots in this chapter. Many statistics and plots described in this chapter are specific to TSA, so even if you are familiar with EDA, you'll find something new.

A part of TSA is collecting and reviewing data, examining the distribution of variables (and variable types), and checking for errors, outliers, and missing values. Some errors, variable types, and anomalies can be corrected, therefore EDA is often performed hand in hand with preprocessing and feature engineering, where columns and fields are selected and transformed. The whole process from data loading to machine learning is highly iterative and may involve multiple instances of TSA at different points.

Here are a few crucial steps for working with time-series:

  • Importing the dataset
  • Data cleaning
  • Understanding variables
  • Uncovering relationships between variables
  • Identifying trend and seasonality
  • Preprocessing (including feature engineering)
  • Training a machine learning model

Importing the data can be considered prior to TSA, and data cleaning, feature engineering, and training a machine learning model are not strictly part of TSA.

Importing the data includes parsing, for example extracting dates. The three steps that are central to TSA are understanding variables, uncovering relationships between variables, and identifying trend and seasonality. There's a lot more to say about each of them, and in this chapter, we'll talk about them in more detail in their dedicated sections.

The steps belonging to TSA and leading to preprocessing (feature engineering) and machine learning are highly iterative, and can be visually appreciated in the following time-series machine learning flywheel:

Flywheel%20-%20page%201.png

Figure 2.1: The time-series machine learning flywheel

This flywheel emphasizes the iterative nature of the work. For example, data cleaning comes often after loading the data, but will come up again after we've made another discovery about our variables. I've highlighted TSA in dark, while steps that are not strictly part of TSA are grayed out.

Let's go through something practical! We'll start by loading a dataset. Right after importing the data, we'd ask questions like what's the size of the dataset (the number of observations)? How many features or columns do we have? What are the column types?

We'll typically look at histograms or distribution plots. For assessing relationships between features and target variables, we'd calculate correlations and visualize them as a correlation heatmap, where the correlation strength between variables is mapped to colors.

We'd look for missing values – in a spreadsheet, these would be empty cells – and we'd clean up and correct these irregularities, where possible.

We are going to be analyzing relationships between variables, and in TSA, one of its peculiarities is that we need to investigate the relationship of time with each variable.

Generally, a useful way of distinguishing different types of techniques could be between univariate and multivariate analysis, and between graphical and non-graphical techniques. Univariate analysis means we are looking at a single variable. This means we could be inspecting values to get the means and the variance, or – for the graphical side – plotting the distribution. We summarize these techniques in the Understanding the variables section.

On the other hand, multivariate analysis means we are calculating correlations between variables, or – for the graphical side – drawing a scatter plot, for example. We'll delve into these techniques in the Uncovering relationships between variables section.

Before we continue, let's go through a bit of the basics of time-series with Python. This will cover the basic operations with time-series data as an introduction. After this, we'll go through Python commands with an actual dataset.