In general, time series serve two purposes. First, they help us to learn about the underlying process that generated the data. On the other hand, we would like to be able to forecast future values of the same or related series using existing data. When we measure temperature, precipitation or wind, we would like to learn more about more complex things, such as weather or the climate of a region and how various factors interact. At the same time, we might be interested in weather forecasting.
Python supports date and time handling in the date time and time modules from the standard library:
Real-world data usually comes in all kinds of shapes and it would be great if we did not need to remember the exact date format specifies for parsing. Thankfully, Pandas abstracts away a lot of the friction, when dealing with strings representing dates or time. One of these helper functions is to_datetime
:
Which means they can be used interchangeably in many cases:
There are a few things to note here: We create a list of timestamp objects and pass it to the series constructor as index. This list of timestamps gets converted into a DatetimeIndex
on the fly. If we had passed only the date strings, we would not get a DatetimeIndex
, just an index
:
Pandas offer another great utility function for this task: date_range
.
The freq
attribute allows us to specify a multitude of options. Pandas has been used successfully in finance and economics, not least because it is really simple to work with business dates as well. As an example, to get an index with the first three business days of the millennium, the B
offset alias can be used:
The following table shows the available offset aliases and can be also be looked up in the Pandas documentation on time series under http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases:
Alias |
Description |
---|---|
B |
Business day frequency |
C |
Custom business day frequency |
D |
Calendar day frequency |
W |
Weekly frequency |
M |
Month end frequency |
BM |
Business month end frequency |
CBM |
Custom business month end frequency |
MS |
Month start frequency |
BMS |
Business month start frequency |
CBMS |
Custom business month start frequency |
Q |
Quarter end frequency |
BQ |
Business quarter frequency |
QS |
Quarter start frequency |
BQS |
Business quarter start frequency |
A |
Year end frequency |
BA |
Business year end frequency |
AS |
Year start frequency |
BAS |
Business year start frequency |
BH |
Business hour frequency |
H |
Hourly frequency |
T |
Minutely frequency |
S |
Secondly frequency |
L |
Milliseconds |
U |
Microseconds |
N |
Nanoseconds |
Moreover, the offset aliases can be used in combination as well. Here, we are generating a datetime
index with five elements, each one day, one hour, one minute and one second apart:
A custom definition of what a business hour means is also possible:
We can use this custom business hour to build indexes as well:
Finally, we can merge various indexes of different frequencies. The possibilities are endless. We only show one example, where we combine two indexes – each over a decade – one pointing to every first business day of a year and one to the last day of February:
A possible output of this plot is show in the following figure:
Just as with usual series objects, you can select parts and slice the index:
We can use date strings as keys, even though our series has a DatetimeIndex
:
Access is similar to lookup in dictionaries or lists, but more powerful. We can, for example, slice with strings or even mixed objects:
To see all entries from March until May, including:
Time series can be shifted forward or backward in time. The index stays in place, the values move:
Resampling describes the process of frequency conversion over time series data. It is a helpful technique in various circumstances as it fosters understanding by grouping together and aggregating data. It is possible to create a new time series from daily temperature data that shows the average temperature per week or month. On the other hand, real-world data may not be taken in uniform intervals and it is required to map observations into uniform intervals or to fill in missing values for certain points in time. These are two of the main use directions of resampling: binning and aggregation, and filling in missing data. Downsampling and upsampling occur in other fields as well, such as digital signal processing. There, the process of downsampling is often called decimation and performs a reduction of the sample rate. The inverse process is called interpolation, where the sample rate is increased. We will look at both directions from a data analysis angle.
Downsampling reduces the number of samples in the data. During this reduction, we are able to apply aggregations over data points. Let's imagine a busy airport with thousands of people passing through every hour. The airport administration has installed a visitor counter in the main area, to get an impression of exactly how busy their airport is.
Or we can reduce the sampling interval even more by resampling to an hourly interval:
We can ask for other things as well. For example, what was the maximum number of people that passed through our airport within one hour:
If you specify a function by string, Pandas uses highly optimized versions.
While in our airport this metric might not be that valuable, we can compute it nonetheless:
In upsampling, the frequency of the time series is increased. As a result, we have more sample points than data points. One of the main questions is how to account for the entries in the series where we have no measurement.
Let's start with hourly data for a single day:
If we upsample to data points taken every 15 minutes, our time series will be extended with NaN
values:
With the limit
parameter, it is possible to control the number of missing values to be filled:
If you want to adjust the labels during resampling, you can use the loffset
keyword argument:
There is another way to fill in missing values. We could employ an algorithm to construct new data points that would somehow fit the existing points, for some definition of somehow. This process is called interpolation.
While, by default, Pandas objects are time zone unaware, many real-world applications will make use of time zones. As with working with time in general, time zones are no trivial matter: do you know which countries have daylight saving time and do you know when the time zone is switched in those countries? Thankfully, Pandas builds on the time zone capabilities of two popular and proven utility libraries for time and date handling: pytz
and dateutil
:
To supply time zone information, you can use the tz
keyword argument:
This works for ranges
as well:
Time zone objects can also be constructed beforehand:
To move a time zone aware object to other time zones, you can use the tz_convert
method:
Along with the powerful timestamp object, which acts as a building block for the DatetimeIndex
, there is another useful data structure, which has been introduced in Pandas 0.15 – the Timedelta. The Timedelta can serve as a basis for indices as well, in this case a TimedeltaIndex
.
Pandas comes with great support for plotting, and this holds true for time series data as well.
As a first example, let's take some monthly data and plot it:
The following figure shows an example time series plot:
We can overlay an aggregate plot over 2 and 5 years:
The following figure shows the resampled 2-year plot:
The following figure shows the resample 5-year plot:
We can pass the kind of chart to the plot
method as well. The return value of the plot
method is an AxesSubplot
, which allows us to customize many aspects of the plot. Here we are setting the label values on the X
axis to the year values from our time series:
Let's imagine we have four time series that we would like to plot simultaneously. We generate a matrix of 1000 × 4 random values and treat each column as a separated time series: