Book Image

Machine Learning for Time-Series with Python

By : Ben Auffarth
Book Image

Machine Learning for Time-Series with Python

By: Ben Auffarth

Overview of this book

The Python time-series ecosystem is huge and often quite hard to get a good grasp on, especially for time-series since there are so many new libraries and new models. This book aims to deepen your understanding of time series by providing a comprehensive overview of popular Python time-series packages and help you build better predictive systems. Machine Learning for Time-Series with Python starts by re-introducing the basics of time series and then builds your understanding of traditional autoregressive models as well as modern non-parametric models. By observing practical examples and the theory behind them, you will become confident with loading time-series datasets from any source, deep learning models like recurrent neural networks and causal convolutional network models, and gradient boosting with feature engineering. This book will also guide you in matching the right model to the right problem by explaining the theory behind several useful models. You’ll also have a look at real-world case studies covering weather, traffic, biking, and stock market data. By the end of this book, you should feel at home with effectively analyzing and applying machine learning methods to time-series.
Table of Contents (15 chapters)
13
Other Books You May Enjoy
14
Index

Working with time-series in Python

Python has a lot of libraries and packages for time-series, such as datetime, time, calendar, dateutil, and pytz, which can be highly confusing for beginners. At the same time, there are many different data types like date, time, datetime, tzinfo, timedelta, relativedelta, and more.

When it comes to using them, the devil is in the details. Just to name one example: many of these types are insensitive to the timezone. You should feel reassured, however, knowing that to get started, familiarity with a small subset of these libraries and data types is enough.

Requirements

In this chapter, we'll use several libraries, which we can quickly install from the terminal (or similarly from Anaconda Navigator):

pip install -U dython scipy numpy pandas seaborn scikit-learn

We'll execute the commands from the Python (or IPython) terminal, but equally we could execute them from a Jupyter notebook (or a different environment).

It's a good start if we at least know datetime and pandas, two very prominent libraries, which we'll cover in the following two sections. We'll create basic objects and do simple manipulations on them.

Datetime

The date and datetime data types are not primitive types in Python the way that numbers (float and int), string, list, dictionary, tuple, or file are. To work with date and datetime objects, we have to import datetime, a library that is part of the Python Standard Library, and the libraries that come by default with CPython and other main Python distributions.

datetime comes with objects such as date, datetime, time, and timedelta, among others. The difference between datetime and date objects is that the datetime object includes time information in addition to a date.

To get a date, we can do this:

from datetime import date

To get today's date:

today = date.today()

To get some other date:

other_date = date(2021, 3, 24)

If we want a datetime object (a timestamp) instead, we can do this as well:

from datetime import datetime
now = datetime.now()

This will get the current timestamp. We can create a datetime for a specific date and time as well:

some_date = datetime(2021, 5, 18, 15, 39, 0)
some_date.isoformat()

We can get a string output in isoformat:

'2021-05-18T15:39:00'

isoformat, short for the ISO 8601 format, is an international standard for representing dates and times.

We can also work with time differences using timedelta:

from datetime import timedelta 
year = timedelta(days=365)

These timedelta objects can be added to other objects for calculations. We can do calculations with a timedelta object, for example:

year * 10

This should give us the following output:

datetime.timedelta(days=3650) 

The datetime library can parse string inputs to date and datetime types and output these objects as string:

from datetime import date
some_date = date.fromisoformat('2021-03-24')

Or:

some_date = datetime.date(2021, 3, 24)

We can format the output with string format options, for example like this:

some_date.strftime('%A %d. %B %Y')

This would give us:

'Wednesday 24. March 2021'

Similarly, we can read in a date or datetime object from a string, and we can use the same format options:

from datetime import datetime
dt = datetime.strptime('24/03/21 15:48', '%d/%m/%y %H:%M')

You can find a complete list of formatting options that you can use both for parsing strings and printing datetime objects here: https://strftime.org/.

A few important ones are listed in this table:

Format string

Meaning

%Y

Year as 4 digits

%y

Year as 2 digits

%m

Month as a number

%d

Day

%H

Hour as 2 digits

%M

Minute as 2 digits

Figure 2.2: Format strings for dates

It's useful to remember these strings with formatting options. For example, the format string for a US date separated by slashes would look like this:

'%d/%m/%Y'

pandas

We introduced the pandas library in the previous chapter. pandas is one of the most important libraries in the Python ecosystem for data science, used for data manipulation and analysis. Initially released in 2008, it has been a major driver of Python's success.

pandas comes with significant time-series functionality such as date range generation, frequency conversion, moving window statistics, date shifting, and lagging.

Let's go through some of these basics. We can create a time-series as follows:

import pandas as pd
pd.date_range(start='2021-03-24', end='2021-09-01')

This gives us a DateTimeIndex like this:

DatetimeIndex(['2021-03-24', '2021-03-25', '2021-03-26', '2021-03-27',
               '2021-03-28', '2021-03-29', '2021-03-30', '2021-03-31',
               '2021-04-01', '2021-04-02',
               ...
               '2021-08-23', '2021-08-24', '2021-08-25', '2021-08-26',
               '2021-08-27', '2021-08-28', '2021-08-29', '2021-08-30',
               '2021-08-31', '2021-09-01'],
              dtype='datetime64[ns]', length=162, freq='D') 

We can also create a time-series as follows:

pd.Series(pd.date_range("2021", freq="D", periods=3))

This would give us a time-series like this:

0   2021-01-01
1   2021-01-02
2   2021-01-03
dtype: datetime64[ns]

As you can see, this type is called a DatetimeIndex. This means we can use this data type for indexing a dataset.

One of the most important functionalities is parsing to date or datetime objects from either string or separate columns:

import pandas as pd
df = pd.DataFrame({'year': [2021, 2022],
    'month': [3, 4],
    'day': [24, 25]}
)
ts1 = pd.to_datetime(df)
ts2 = pd.to_datetime('20210324', format='%Y%m%d')

We've created two time-series.

You can take a rolling window for calculations like this:

s = pd.Series([1, 2, 3, 4, 5])
s.rolling(3).sum()

Can you guess the result of this? If not, why don't you put this into your Python interpreter?

A time-series would usually be an index with a time object and one or more columns with numeric or other types, such as this:

import numpy as np 
rng = pd.date_range('2021-03-24', '2021-09-01', freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)

We can have a look at our time-series:

2021-03-24   -2.332713
2021-03-25    0.177074
2021-03-26   -2.136295
2021-03-27    2.992240
2021-03-28   -0.457537
                 ...
2021-08-28   -0.705022
2021-08-29    1.089697
2021-08-30    0.384947
2021-08-31    1.003391
2021-09-01   -1.021058
Freq: D, Length: 162, dtype: float64

We can index these time-series datasets like any other pandas Series or DataFrame. ts[:2].index would give us:

DatetimeIndex(['2021-03-24', '2021-03-25'], dtype='datetime64[ns]', freq='D')

Interestingly, we can index directly with strings or datetime objects. For example, ts['2021-03-28':'2021-03-30'] gives us:

2021-03-28   -0.457537
2021-03-29   -1.089423
2021-03-30   -0.708091
Freq: D, dtype: float64

You can shift or lag the values in a time-series back and forward in time using the shift method. This changes the alignment of the data:

ts.shift(1)[:5]

We can also change the resolution of time-series objects, for example like this:

ts.asfreq('M')

Please note the difference between datetime and pd.DateTimeIndex. Even though they encode the same kind of information, they are different types and they might not always play well with each other. Therefore, I'd recommend to always explicitly convert types when doing comparisons.

In the next section, let's go through a basic example of importing a time-series dataset, getting summary statistics, and plotting some variables.