Python: Data Analytics and Visualization

By : Martin Czygan, Phuong Vo.T.H, Ashish Kumar, Kirthi Raman

Python: Data Analytics and Visualization

By: Martin Czygan, Phuong Vo.T.H, Ashish Kumar, Kirthi Raman

Overview of this book

You will start the course with an introduction to the principles of data analysis and supported libraries, along with NumPy basics for statistics and data processing. Next, you will overview the Pandas package and use its powerful features to solve data-processing problems. Moving on, you will get a brief overview of the Matplotlib API .Next, you will learn to manipulate time and data structures, and load and store data in a file or database using Python packages. You will learn how to apply powerful packages in Python to process raw data into pure and helpful data using examples. You will also get a brief overview of machine learning algorithms, that is, applying data analysis results to make decisions or building helpful products such as recommendations and predictions using Scikit-learn. After this, you will move on to a data analytics specialization—predictive analytics. Social media and IOT have resulted in an avalanche of data. You will get started with predictive analytics using Python. You will see how to create predictive models from data. You will get balanced information on statistical and mathematical concepts, and implement them in Python using libraries such as Pandas, scikit-learn, and NumPy. You’ll learn more about the best predictive modeling algorithms such as Linear Regression, Decision Tree, and Logistic Regression. Finally, you will master best practices in predictive modeling. After this, you will get all the practical guidance you need to help you on the journey to effective data visualization. Starting with a chapter on data frameworks, which explains the transformation of data into information and eventually knowledge, this path subsequently cover the complete visualization process using the most popular Python libraries with working examples This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products: ? Getting Started with Python Data Analysis, Phuong Vo.T.H &Martin Czygan •Learning Predictive Analytics with Python, Ashish Kumar •Mastering Python Data Visualization, Kirthi Raman

Preface

What this learning path covers

What you need for this learning path

Who this learning path is for

Reader feedback

Customer support

Free Chapter

1. Module 1

1. Introducing Data Analysis and Libraries

2. NumPy Arrays and Vectorized Computation

3. Data Analysis with Pandas

4. Data Visualization

5. Time Series

6. Interacting with Databases

7. Data Analysis Application Examples

8. Machine Learning Models with scikit-learn

2. Module 2

1. Getting Started with Predictive Modelling

2. Data Cleaning

3. Data Wrangling

4. Statistical Concepts for Predictive Modelling

5. Linear Regression with Python

6. Logistic Regression with Python

7. Clustering with Python

8. Trees and Random Forests with Python

9. Best Practices for Predictive Modelling

A. A List of Links

3. Module 3

1. A Conceptual Framework for Data Visualization

2. Data Analysis and Visualization

3. Getting Started with the Python IDE

4. Numerical Computing and Interactive Plotting

5. Financial and Statistical Models

6. Statistical and Machine Learning

7. Bioinformatics, Genetics, and Network Models

8. Advanced Visualization

B. Go Forth and Explore Visualization

Bibliography

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Chapter 5. Time Series

Time series typically consist of a sequence of data points coming from measurements taken over time. This kind of data is very common and occurs in a multitude of fields.

A business executive is interested in stock prices, prices of goods and services or monthly sales figures. A meteorologist takes temperature measurements several times a day and also keeps records of precipitation, humidity, wind direction and force. A neurologist can use electroencephalography to measure electrical activity of the brain along the scalp. A sociologist can use campaign contribution data to learn about political parties and their supporters and use these insights as an argumentation aid. More examples for time series data can be enumerated almost endlessly.

Time series primer

In general, time series serve two purposes. First, they help us to learn about the underlying process that generated the data. On the other hand, we would like to be able to forecast future values of the same or related series using existing data. When we measure temperature, precipitation or wind, we would like to learn more about more complex things, such as weather or the climate of a region and how various factors interact. At the same time, we might be interested in weather forecasting.

In this chapter we will explore the time series capabilities of Pandas. Apart from its powerful core data structures – the series and the DataFrame – Pandas comes with helper functions for dealing with time related data. With its extensive built-in optimizations, Pandas is capable of handling large time series with millions of data points with ease.

We will gradually approach time series, starting with the basic building blocks of date and time objects.

Working with date and time objects

Python supports date and time handling in the date time and time modules from the standard library:

>>> import datetime
>>> datetime.datetime(2000, 1, 1)
datetime.datetime(2000, 1, 1, 0, 0)

Sometimes, dates are given or expected as strings, so a conversion from or to strings is necessary, which is realized by two functions: strptime and strftime, respectively:

>>> datetime.datetime.strptime("2000/1/1", "%Y/%m/%d")
datetime.datetime(2000, 1, 1, 0, 0)
>>> datetime.datetime(2000, 1, 1, 0, 0).strftime("%Y%m%d")
'20000101'

Real-world data usually comes in all kinds of shapes and it would be great if we did not need to remember the exact date format specifies for parsing. Thankfully, Pandas abstracts away a lot of the friction, when dealing with strings representing dates or time. One of these helper functions is to_datetime:

>>> import pandas as pd
>>> import numpy as np
>>> pd.to_datetime("4th of July")
Timestamp('2015-07-04 
>>> pd.to_datetime("13.01.2000")
Timestamp('2000-01-13 00:00:00')
>>> pd.to_datetime("7/8/2000")
Timestamp('2000-07-08 00:00:00')

The last can refer to August 7th or July 8th, depending on the region. To disambiguate this case, to_datetime can be passed a keyword argument dayfirst:

>>> pd.to_datetime("7/8/2000", dayfirst=True)
Timestamp('2000-08-07 00:00:00')

Timestamp objects can be seen as Pandas' version of datetime objects and indeed, the Timestamp class is a subclass of datetime:

>>> issubclass(pd.Timestamp, datetime.datetime)
True

Which means they can be used interchangeably in many cases:

>>> ts = pd.to_datetime(946684800000000000)
>>> ts.year, ts.month, ts.day, ts.weekday()
(2000, 1, 1, 5)

Timestamp objects are an important part of time series capabilities of Pandas, since timestamps are the building block of DateTimeIndex objects:

>>> index = [pd.Timestamp("2000-01-01"),
             pd.Timestamp("2000-01-02"),
             pd.Timestamp("2000-01-03")]
>>> ts = pd.Series(np.random.randn(len(index)), index=index)
>>> ts
2000-01-01    0.731897
2000-01-02    0.761540
2000-01-03   -1.316866
dtype: float64
>>> ts.indexDatetime
Index(['2000-01-01', '2000-01-02', '2000-01-03'],
dtype='datetime64[ns]', freq=None, tz=None)

There are a few things to note here: We create a list of timestamp objects and pass it to the series constructor as index. This list of timestamps gets converted into a DatetimeIndex on the fly. If we had passed only the date strings, we would not get a DatetimeIndex, just an index:

>>> ts = pd.Series(np.random.randn(len(index)), index=[
              "2000-01-01", "2000-01-02", "2000-01-03"])
>>> ts.index
Index([u'2000-01-01', u'2000-01-02', u'2000-01-03'], dtype='object')

However, the to_datetime function is flexible enough to be of help, if all we have is a list of date strings:

>>> index = pd.to_datetime(["2000-01-01", "2000-01-02", "2000-01-03"])
>>> ts = pd.Series(np.random.randn(len(index)), index=index)
>>> ts.index
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03'], dtype='datetime64[ns]', freq=None, tz=None))

Another thing to note is that while we have a DatetimeIndex, the freq and tz attributes are both None. We will learn about the utility of both attributes later in this chapter.

With to_datetime we are able to convert a variety of strings and even lists of strings into timestamp or DatetimeIndex objects. Sometimes we are not explicitly given all the information about a series and we have to generate sequences of time stamps of fixed intervals ourselves.

Pandas offer another great utility function for this task: date_range.

The date_range function helps to generate a fixed frequency datetime index between start and end dates. It is also possible to specify either the start or end date and the number of timestamps to generate.

The frequency can be specified by the freq parameter, which supports a number of offsets. You can use typical time intervals like hours, minutes, and seconds:

>>> pd.date_range(start="2000-01-01", periods=3, freq='H')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:00:00', '2000-01-01 02:00:00'], dtype='datetime64[ns]', freq='H', tz=None)
>>> pd.date_range(start="2000-01-01", periods=3, freq='T')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:01:00', '2000-01-01 00:02:00'], dtype='datetime64[ns]', freq='T', tz=None)
>>> pd.date_range(start="2000-01-01", periods=3, freq='S')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:00:01', '2000-01-01 00:00:02'], dtype='datetime64[ns]', freq='S', tz=None)

The freq attribute allows us to specify a multitude of options. Pandas has been used successfully in finance and economics, not least because it is really simple to work with business dates as well. As an example, to get an index with the first three business days of the millennium, the B offset alias can be used:

>>> pd.date_range(start="2000-01-01", periods=3, freq='B')
DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05'], dtype='datetime64[ns]', freq='B', tz=None)

The following table shows the available offset aliases and can be also be looked up in the Pandas documentation on time series under http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases:

Alias	Description
B	Business day frequency
C	Custom business day frequency
D	Calendar day frequency
W	Weekly frequency
M	Month end frequency
BM	Business month end frequency
CBM	Custom business month end frequency
MS	Month start frequency
BMS	Business month start frequency
CBMS	Custom business month start frequency
Q	Quarter end frequency
BQ	Business quarter frequency
QS	Quarter start frequency
BQS	Business quarter start frequency
A	Year end frequency
BA	Business year end frequency
AS	Year start frequency
BAS	Business year start frequency
BH	Business hour frequency
H	Hourly frequency
T	Minutely frequency
S	Secondly frequency
L	Milliseconds
U	Microseconds
N	Nanoseconds

Moreover, the offset aliases can be used in combination as well. Here, we are generating a datetime index with five elements, each one day, one hour, one minute and one second apart:

>>> pd.date_range(start="2000-01-01", periods=5, freq='1D1h1min10s')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-02 01:01:10', '2000-01-03 02:02:20', '2000-01-04 03:03:30', '2000-01-05 04:04:40'], dtype='datetime64[ns]', freq='90070S', tz=None)

If we want to index data every 12 hours of our business time, which by default starts at 9 AM and ends at 5 PM, we would simply prefix the BH alias:

>>> pd.date_range(start="2000-01-01", periods=5, freq='12BH')
DatetimeIndex(['2000-01-03 09:00:00', '2000-01-04 13:00:00', '2000-01-06 09:00:00', '2000-01-07 13:00:00', '2000-01-11 09:00:00'], dtype='datetime64[ns]', freq='12BH', tz=None)

A custom definition of what a business hour means is also possible:

>>> ts.index
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03'], dtype='datetime64[ns]', freq=None, tz=None)

We can use this custom business hour to build indexes as well:

>>> pd.date_range(start="2000-01-01", periods=5, freq=12 * bh)
DatetimeIndex(['2000-01-03 07:00:00', '2000-01-03 19:00:00', '2000-01-04 07:00:00', '2000-01-04 19:00:00', '2000-01-05 07:00:00', '2000-01-05 19:00:00', '2000-01-06 07:00:00'], dtype='datetime64[ns]', freq='12BH', tz=None)

Some frequencies allow us to specify an anchoring suffix, which allows us to express intervals, such as every Friday or every second Tuesday of the month:

>>> pd.date_range(start="2000-01-01", periods=5, freq='W-FRI')
DatetimeIndex(['2000-01-07', '2000-01-14', '2000-01-21', '2000-01-28', '2000-02-04'], dtype='datetime64[ns]', freq='W-FRI', tz=None)
>>> pd.date_range(start="2000-01-01", periods=5, freq='WOM-2TUE')
DatetimeIndex(['2000-01-11', '2000-02-08', '2000-03-14', '2000-04-11', '2000-05-09'], dtype='datetime64[ns]', freq='WOM-2TUE', tz=None)

Finally, we can merge various indexes of different frequencies. The possibilities are endless. We only show one example, where we combine two indexes – each over a decade – one pointing to every first business day of a year and one to the last day of February:

>>> s = pd.date_range(start="2000-01-01", periods=10, freq='BAS-JAN')
>>> t = pd.date_range(start="2000-01-01", periods=10, freq='A-FEB')
>>> s.union(t)
DatetimeIndex(['2000-01-03', '2000-02-29', '2001-01-01', '2001-02-28', '2002-01-01', '2002-02-28', '2003-01-01', '2003-02-28','2004-01-01', '2004-02-29', '2005-01-03', '2005-02-28', '2006-01-02', '2006-02-28', '2007-01-01', '2007-02-28','2008-01-01', '2008-02-29', '2009-01-01', '2009-02-28'], dtype='datetime64[ns]', freq=None, tz=None)

We see, that 2000 and 2005 did not start on a weekday and that 2000, 2004, and 2008 were the leap years.

We have seen two powerful functions so far, to_datetime and date_range. Now we want to dive into time series by first showing how you can create and plot time series data with only a few lines. In the rest of this section, we will show various ways to access and slice time series data.

It is easy to get started with time series data in Pandas. A random walk can be created and plotted in a few lines:

>>> index = pd.date_range(start='2000-01-01', periods=200, freq='B')
>>> ts = pd.Series(np.random.randn(len(index)), index=index)
>>> walk = ts.cumsum()
>>> walk.plot()

A possible output of this plot is show in the following figure:

Just as with usual series objects, you can select parts and slice the index:

>>> ts.head()
2000-01-03    1.464142
2000-01-04    0.103077
2000-01-05    0.762656
2000-01-06    1.157041
2000-01-07   -0.427284
Freq: B, dtype: float64
>>> ts[0]
1.4641415817112928
>>> ts[1:3]
2000-01-04    0.103077
2000-01-05    0.762656

We can use date strings as keys, even though our series has a DatetimeIndex:

>>> ts['2000-01-03']
1.4641415817112928

Even though the DatetimeIndex is made of timestamp objects, we can use datetime objects as keys as well:

>>> ts[datetime.datetime(2000, 1, 3)]
1.4641415817112928

Access is similar to lookup in dictionaries or lists, but more powerful. We can, for example, slice with strings or even mixed objects:

>>> ts['2000-01-03':'2000-01-05']
2000-01-03    1.464142
2000-01-04    0.103077
2000-01-05    0.762656
Freq: B, dtype: float64
>>> ts['2000-01-03':datetime.datetime(2000, 1, 5)]
2000-01-03    1.464142
2000-01-04    0.103077
2000-01-05    0.762656
Freq: B, dtype: float64
>>> ts['2000-01-03':datetime.date(2000, 1, 5)]
2000-01-03   -0.807669
2000-01-04    0.029802
2000-01-05   -0.434855
Freq: B, dtype: float64

It is even possible to use partial strings to select groups of entries. If we are only interested in February, we could simply write:

>>> ts['2000-02']
2000-02-01    0.277544
2000-02-02   -0.844352
2000-02-03   -1.900688
2000-02-04   -0.120010
2000-02-07   -0.465916
2000-02-08   -0.575722
2000-02-09    0.426153
2000-02-10    0.720124
2000-02-11    0.213050
2000-02-14   -0.604096
2000-02-15   -1.275345
2000-02-16   -0.708486
2000-02-17   -0.262574
2000-02-18    1.898234
2000-02-21    0.772746
2000-02-22    1.142317
2000-02-23   -1.461767
2000-02-24   -2.746059
2000-02-25   -0.608201
2000-02-28    0.513832
2000-02-29   -0.132000

To see all entries from March until May, including:

>>> ts['2000-03':'2000-05']
2000-03-01    0.528070
2000-03-02    0.200661
                    ...
2000-05-30    1.206963
2000-05-31    0.230351
Freq: B, dtype: float64

Time series can be shifted forward or backward in time. The index stays in place, the values move:

>>> small_ts = ts['2000-02-01':'2000-02-05']
>>> small_ts
2000-02-01    0.277544
2000-02-02   -0.844352
2000-02-03   -1.900688
2000-02-04   -0.120010
Freq: B, dtype: float64
>>> small_ts.shift(2)
2000-02-01         NaN
2000-02-02         NaN
2000-02-03    0.277544
2000-02-04   -0.844352
Freq: B, dtype: float64

To shift backwards in time, we simply use negative values:

>>> small_ts.shift(-2)
2000-02-01   -1.900688
2000-02-02   -0.120010
2000-02-03         NaN
2000-02-04         NaN
Freq: B, dtype: float64

Resampling time series

Resampling describes the process of frequency conversion over time series data. It is a helpful technique in various circumstances as it fosters understanding by grouping together and aggregating data. It is possible to create a new time series from daily temperature data that shows the average temperature per week or month. On the other hand, real-world data may not be taken in uniform intervals and it is required to map observations into uniform intervals or to fill in missing values for certain points in time. These are two of the main use directions of resampling: binning and aggregation, and filling in missing data. Downsampling and upsampling occur in other fields as well, such as digital signal processing. There, the process of downsampling is often called decimation and performs a reduction of the sample rate. The inverse process is called interpolation, where the sample rate is increased. We will look at both directions from a data analysis angle.

Downsampling time series data

Downsampling reduces the number of samples in the data. During this reduction, we are able to apply aggregations over data points. Let's imagine a busy airport with thousands of people passing through every hour. The airport administration has installed a visitor counter in the main area, to get an impression of exactly how busy their airport is.

They are receiving data from the counter device every minute. Here are the hypothetical measurements for a day, beginning at 08:00, ending 600 minutes later at 18:00:

>>> rng = pd.date_range('4/29/2015 8:00', periods=600, freq='T')
>>> ts = pd.Series(np.random.randint(0, 100, len(rng)), index=rng)
>>> ts.head()
2015-04-29 08:00:00     9
2015-04-29 08:01:00    60
2015-04-29 08:02:00    65
2015-04-29 08:03:00    25
2015-04-29 08:04:00    19

To get a better picture of the day, we can downsample this time series to larger intervals, for example, 10 minutes. We can choose an aggregation function as well. The default aggregation is to take all the values and calculate the mean:

>>> ts.resample('10min').head()
2015-04-29 08:00:00    49.1
2015-04-29 08:10:00    56.0
2015-04-29 08:20:00    42.0
2015-04-29 08:30:00    51.9
2015-04-29 08:40:00    59.0
Freq: 10T, dtype: float64

In our airport example, we are also interested in the sum of the values, that is, the combined number of visitors for a given time frame. We can choose the aggregation function by passing a function or a function name to the how parameter works:

>>> ts.resample('10min', how='sum').head()
2015-04-29 08:00:00    442
2015-04-29 08:10:00    409
2015-04-29 08:20:00    532
2015-04-29 08:30:00    433
2015-04-29 08:40:00    470
Freq: 10T, dtype: int64

Or we can reduce the sampling interval even more by resampling to an hourly interval:

>>> ts.resample('1h', how='sum').head()
2015-04-29 08:00:00    2745
2015-04-29 09:00:00    2897
2015-04-29 10:00:00    3088
2015-04-29 11:00:00    2616
2015-04-29 12:00:00    2691
Freq: H, dtype: int64

We can ask for other things as well. For example, what was the maximum number of people that passed through our airport within one hour:

>>> ts.resample('1h', how='max').head()
2015-04-29 08:00:00    97
2015-04-29 09:00:00    98
2015-04-29 10:00:00    99
2015-04-29 11:00:00    98
2015-04-29 12:00:00    99
Freq: H, dtype: int64

Or we can define a custom function if we are interested in more unusual metrics. For example, we could be interested in selecting a random sample for each hour:

>>> import random
>>> ts.resample('1h', how=lambda m: random.choice(m)).head()
2015-04-29 08:00:00    28
2015-04-29 09:00:00    14
2015-04-29 10:00:00    68
2015-04-29 11:00:00    31
2015-04-29 12:00:00     5

If you specify a function by string, Pandas uses highly optimized versions.

The built-in functions that can be used as argument to how are: sum, mean, std, sem, max, min, median, first, last, ohlc. The ohlc metric is popular in finance. It stands for open-high-low-close. An OHLC chart is a typical way to illustrate movements in the price of a financial instrument over time.

While in our airport this metric might not be that valuable, we can compute it nonetheless:

>>> ts.resample('1h', how='ohlc').head()
                     open  high  low  close
2015-04-29 08:00:00     9    97    0     14
2015-04-29 09:00:00    68    98    3     12
2015-04-29 10:00:00    71    99    1      1
2015-04-29 11:00:00    59    98    0      4
2015-04-29 12:00:00    56    99    3
     55

Upsampling time series data

In upsampling, the frequency of the time series is increased. As a result, we have more sample points than data points. One of the main questions is how to account for the entries in the series where we have no measurement.

Let's start with hourly data for a single day:

>>> rng = pd.date_range('4/29/2015 8:00', periods=10, freq='H')
>>> ts = pd.Series(np.random.randint(0, 100, len(rng)), index=rng)
>>> ts.head()
2015-04-29 08:00:00    30
2015-04-29 09:00:00    27
2015-04-29 10:00:00    54
2015-04-29 11:00:00     9
2015-04-29 12:00:00    48
Freq: H, dtype: int64

If we upsample to data points taken every 15 minutes, our time series will be extended with NaN values:

>>> ts.resample('15min')
>>> ts.head()
2015-04-29 08:00:00    30
2015-04-29 08:15:00   NaN
2015-04-29 08:30:00   NaN
2015-04-29 08:45:00   NaN
2015-04-29 09:00:00    27

There are various ways to deal with missing values, which can be controlled by the fill_method keyword argument to resample. Values can be filled either forward or backward:

>>> ts.resample('15min', fill_method='ffill').head()
2015-04-29 08:00:00    30
2015-04-29 08:15:00    30
2015-04-29 08:30:00    30
2015-04-29 08:45:00    30
2015-04-29 09:00:00    27
Freq: 15T, dtype: int64
>>> ts.resample('15min', fill_method='bfill').head()
2015-04-29 08:00:00    30
2015-04-29 08:15:00    27
2015-04-29 08:30:00    27
2015-04-29 08:45:00    27
2015-04-29 09:00:00    27

With the limit parameter, it is possible to control the number of missing values to be filled:

>>> ts.resample('15min', fill_method='ffill', limit=2).head()
2015-04-29 08:00:00    30
2015-04-29 08:15:00    30
2015-04-29 08:30:00    30
2015-04-29 08:45:00   NaN
2015-04-29 09:00:00    27
Freq: 15T, dtype: float64

If you want to adjust the labels during resampling, you can use the loffset keyword argument:

>>> ts.resample('15min', fill_method='ffill', limit=2, loffset='5min').head()
2015-04-29 08:05:00    30
2015-04-29 08:20:00    30
2015-04-29 08:35:00    30
2015-04-29 08:50:00   NaN
2015-04-29 09:05:00    27
Freq: 15T, dtype: float64

There is another way to fill in missing values. We could employ an algorithm to construct new data points that would somehow fit the existing points, for some definition of somehow. This process is called interpolation.

We can ask Pandas to interpolate a time series for us:

>>> tsx = ts.resample('15min')
>>> tsx.interpolate().head()
2015-04-29 08:00:00    30.00
2015-04-29 08:15:00    29.25
2015-04-29 08:30:00    28.50
2015-04-29 08:45:00    27.75
2015-04-29 09:00:00    27.00
Freq: 15T, dtype: float64

We saw the default interpolate method – a linear interpolation – in action. Pandas assumes a linear relationship between two existing points.

Pandas supports over a dozen interpolation functions, some of which require the scipy library to be installed. We will not cover interpolation methods in this chapter, but we encourage you to explore the various methods yourself. The right interpolation method will depend on the requirements of your application.

Time zone handling

While, by default, Pandas objects are time zone unaware, many real-world applications will make use of time zones. As with working with time in general, time zones are no trivial matter: do you know which countries have daylight saving time and do you know when the time zone is switched in those countries? Thankfully, Pandas builds on the time zone capabilities of two popular and proven utility libraries for time and date handling: pytz and dateutil:

>>> t = pd.Timestamp('2000-01-01')
>>> t.tz is None
True

To supply time zone information, you can use the tz keyword argument:

>>> t = pd.Timestamp('2000-01-01', tz='Europe/Berlin')
>>> t.tz
<DstTzInfo 'Europe/Berlin' CET+1:00:00 STD>

This works for ranges as well:

>>> rng = pd.date_range('1/1/2000 00:00', periods=10, freq='D', tz='Europe/London')
>>> rng
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08','2000-01-09', '2000-01-10'], dtype='datetime64[ns]', freq='D', tz='Europe/London')

Time zone objects can also be constructed beforehand:

>>> import pytz
>>> tz = pytz.timezone('Europe/London')
>>> rng = pd.date_range('1/1/2000 00:00', periods=10, freq='D', tz=tz)
>>> rng
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08', '2000-01-09', '2000-01-10'], dtype='datetime64[ns]', freq='D', tz='Europe/London')

Sometimes, you will already have a time zone unaware time series object that you would like to make time zone aware. The tz_localize function helps to switch between time zone aware and time zone unaware objects:

>>> rng = pd.date_range('1/1/2000 00:00', periods=10, freq='D')
>>> ts = pd.Series(np.random.randn(len(rng)), rng)
>>> ts.index.tz is None
True
>>> ts_utc = ts.tz_localize('UTC')
>>> ts_utc.index.tz
<UTC>

To move a time zone aware object to other time zones, you can use the tz_convert method:

>>> ts_utc.tz_convert('Europe/Berlin').index.tz
<DstTzInfo 'Europe/Berlin' LMT+0:53:00 STD>

Finally, to detach any time zone information from an object, it is possible to pass None to either tz_convert or tz_localize:

>>> ts_utc.tz_convert(None).index.tz is None
True
>>> ts_utc.tz_localize(None).index.tz
 is None
True

Timedeltas

Along with the powerful timestamp object, which acts as a building block for the DatetimeIndex, there is another useful data structure, which has been introduced in Pandas 0.15 – the Timedelta. The Timedelta can serve as a basis for indices as well, in this case a TimedeltaIndex.

Timedeltas are differences in times, expressed in difference units. The Timedelta class in Pandas is a subclass of datetime.timedelta from the Python standard library. As with other Pandas data structures, the Timedelta can be constructed from a variety of inputs:

>>> pd.Timedelta('1 days')
Timedelta('1 days 00:00:00')
>>> pd.Timedelta('-1 days 2 min 10s 3us')
Timedelta('-2 days +23:57:49.999997')
>>> pd.Timedelta(days=1,seconds=1)
Timedelta('1 days 00:00:01')

As you would expect, Timedeltas allow basic arithmetic:

>>> pd.Timedelta(days=1) + pd.Timedelta(seconds=1)
Timedelta('1 days 00:00:01')

Similar to to_datetime, there is a to_timedelta function that can parse strings or lists of strings into Timedelta structures or TimedeltaIndices:

>>> pd.to_timedelta('20.1s')
Timedelta('0 days 00:00:20.100000')

Instead of absolute dates, we could create an index of timedeltas. Imagine measurements from a volcano, for example. We might want to take measurements but index it from a given date, for example the date of the last eruption. We could create a timedelta index that has the last seven days as entries:

>>> pd.to_timedelta(np.arange(7), unit='D')
TimedeltaIndex(['0 days', '1 days', '2 days', '3 days', '4 days', '5 days', '6 days'], dtype='timedelta64[ns]', freq=None)

We could then work with time series data, indexed from the last eruption. If we had measurements for many eruptions (from possibly multiple volcanos), we would have an index that would make comparisons and analysis of this data easier. For example, we could ask whether there is a typical pattern that occurs between the third day and the fifth day after an eruption. This question would not be impossible to answer with a DatetimeIndex, but a TimedeltaIndex makes this kind of exploration much more convenient.

Time series plotting

Pandas comes with great support for plotting, and this holds true for time series data as well.

As a first example, let's take some monthly data and plot it:

>>> rng = pd.date_range(start='2000', periods=120, freq='MS')
>>> ts = pd.Series(np.random.randint(-10, 10, size=len(rng)), rng).cumsum()
>>> ts.head()
2000-01-01    -4
2000-02-01    -6
2000-03-01   -16
2000-04-01   -26
2000-05-01   -24
Freq: MS, dtype: int64

Since matplotlib is used under the hood, we can pass a familiar parameter to plot, such as c for color, or title for the chart title:

>>> ts.plot(c='k', title='Example time series')
>>> plt.show()

The following figure shows an example time series plot:

We can overlay an aggregate plot over 2 and 5 years:

>>> ts.resample('2A').plot(c='0.75', ls='--')
>>> ts.resample('5A').plot(c='0.25', ls='-.')

The following figure shows the resampled 2-year plot:

The following figure shows the resample 5-year plot:

We can pass the kind of chart to the plot method as well. The return value of the plot method is an AxesSubplot, which allows us to customize many aspects of the plot. Here we are setting the label values on the X axis to the year values from our time series:

>>> plt.clf()
>>> tsx = ts.resample('1A')
>>> ax = tsx.plot(kind='bar', color='k')
>>> ax.set_xticklabels(tsx.index.year)

Let's imagine we have four time series that we would like to plot simultaneously. We generate a matrix of 1000 × 4 random values and treat each column as a separated time series:

>>> plt.clf()
>>> ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
>>> df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
>>> df = df.cumsum()>>> df.plot(color=['k', '0.75', '0.5', '0.25'], ls='--')

Summary

In this chapter we showed how you can work with time series in Pandas. We introduced two index types, the DatetimeIndex and the TimedeltaIndex and explored their building blocks in depth. Pandas comes with versatile helper functions that take much of the pain out of parsing dates of various formats or generating fixed frequency sequences. Resampling data can help get a more condensed picture of the data, or it can help align various datasets of different frequencies to one another. One of the explicit goals of Pandas is to make it easy to work with missing data, which is also relevant in the context of upsampling.

Finally, we showed how time series can be visualized. Since matplotlib and Pandas are natural companions, we discovered that we can reuse our previous knowledge about matplotlib for time series data as well.

In the next chapter, we will explore ways to load and store data in text files and databases.

Practice exercises

Exercise 1: Find one or two real-world examples for data sets, which could – in a sensible way – be assigned to the following groups:

Fixed frequency data
Variable frequency data
Data where frequency is usually measured in seconds
Data where frequency is measured in nanoseconds
Data, where a TimedeltaIndex would be preferable

Create various fixed frequency ranges:

Every minute between 1 AM and 2 AM on 2000-01-01
Every two hours for a whole week starting 2000-01-01
An entry for every Saturday and Sunday during the year 2000
An entry for every Monday of a month, if it was a business day, for the years 2000, 2001 and 2002

Python: Data Analytics and Visualization

By : Martin Czygan, Phuong Vo.T.H, Ashish Kumar, Kirthi Raman

Python: Data Analytics and Visualization

By: Martin Czygan, Phuong Vo.T.H, Ashish Kumar, Kirthi Raman

Overview of this book

Related Content you might be interested in

Current Title:

Python: Data Analytics and Visualization

Chapter 5. Time Series

Time series primer

Working with date and time objects

Resampling time series

Downsampling time series data

Upsampling time series data

Time zone handling

Timedeltas

Time series plotting

Summary