Book Image

Python: End-to-end Data Analysis

By : Ivan Idris, Luiz Felipe Martins, Martin Czygan, Phuong Vo.T.H, Magnus Vilhelm Persson
Book Image

Python: End-to-end Data Analysis

By: Ivan Idris, Luiz Felipe Martins, Martin Czygan, Phuong Vo.T.H, Magnus Vilhelm Persson

Overview of this book

Data analysis is the process of applying logical and analytical reasoning to study each component of data present in the system. Python is a multi-domain, high-level, programming language that offers a range of tools and libraries suitable for all purposes, it has slowly evolved as one of the primary languages for data science. Have you ever imagined becoming an expert at effectively approaching data analysis problems, solving them, and extracting all of the available information from your data? If yes, look no further, this is the course you need! In this course, we will get you started with Python data analysis by introducing the basics of data analysis and supported Python libraries such as matplotlib, NumPy, and pandas. Create visualizations by choosing color maps, different shapes, sizes, and palettes then delve into statistical data analysis using distribution algorithms and correlations. You’ll then find your way around different data and numerical problems, get to grips with Spark and HDFS, and set up migration scripts for web mining. You’ll be able to quickly and accurately perform hands-on sorting, reduction, and subsequent analysis, and fully appreciate how data analysis methods can support business decision-making. Finally, you will delve into advanced techniques such as performing regression, quantifying cause and effect using Bayesian methods, and discovering how to use Python’s tools for supervised machine learning. The course provides you with highly practical content explaining data analysis with Python, from the following Packt books: 1. Getting Started with Python Data Analysis. 2. Python Data Analysis Cookbook. 3. Mastering Python Data Analysis. By the end of this course, you will have all the knowledge you need to analyze your data with varying complexity levels, and turn it into actionable insights.
Table of Contents (6 chapters)

Time series typically consist of a sequence of data points coming from measurements taken over time. This kind of data is very common and occurs in a multitude of fields.

A business executive is interested in stock prices, prices of goods and services or monthly sales figures. A meteorologist takes temperature measurements several times a day and also keeps records of precipitation, humidity, wind direction and force. A neurologist can use electroencephalography to measure electrical activity of the brain along the scalp. A sociologist can use campaign contribution data to learn about political parties and their supporters and use these insights as an argumentation aid. More examples for time series data can be enumerated almost endlessly.

Python supports date and time handling in the date time and time modules from the standard library:

Sometimes, dates are given or expected as strings, so a conversion from or to strings is necessary, which is realized by two functions: strptime and strftime, respectively:

Real-world data usually comes in all kinds of shapes and it would be great if we did not need to remember the exact date format specifies for parsing. Thankfully, Pandas abstracts away a lot of the friction, when dealing with strings representing dates or time. One of these helper functions is to_datetime:

The last can refer to August 7th or July 8th, depending on the region. To disambiguate this case, to_datetime can be passed a keyword argument dayfirst:

Timestamp objects can be seen as Pandas' version of datetime objects and indeed, the Timestamp class is a subclass of datetime:

Which means they can be used interchangeably in many cases:

Timestamp objects are an important part of time series capabilities of Pandas, since timestamps are the building block of DateTimeIndex objects:

There are a few things to note here: We create a list of timestamp objects and pass it to the series constructor as index. This list of timestamps gets converted into a DatetimeIndex on the fly. If we had passed only the date strings, we would not get a DatetimeIndex, just an index:

However, the to_datetime function is flexible enough to be of help, if all we have is a list of date strings:

Another thing to note is that while we have a DatetimeIndex, the freq and tz attributes are both None. We will learn about the utility of both attributes later in this chapter.

With to_datetime we are able to convert a variety of strings and even lists of strings into timestamp or DatetimeIndex objects. Sometimes we are not explicitly given all the information about a series and we have to generate sequences of time stamps of fixed intervals ourselves.

Pandas offer another great utility function for this task: date_range.

The date_range function helps to generate a fixed frequency datetime index between start and end dates. It is also possible to specify either the start or end date and the number of timestamps to generate.

The frequency can be specified by the freq parameter, which supports a number of offsets. You can use typical time intervals like hours, minutes, and seconds:

The freq attribute allows us to specify a multitude of options. Pandas has been used successfully in finance and economics, not least because it is really simple to work with business dates as well. As an example, to get an index with the first three business days of the millennium, the B offset alias can be used:

The following table shows the available offset aliases and can be also be looked up in the Pandas documentation on time series under http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases:

Alias

Description

B

Business day frequency

C

Custom business day frequency

D

Calendar day frequency

W

Weekly frequency

M

Month end frequency

BM

Business month end frequency

CBM

Custom business month end frequency

MS

Month start frequency

BMS

Business month start frequency

CBMS

Custom business month start frequency

Q

Quarter end frequency

BQ

Business quarter frequency

QS

Quarter start frequency

BQS

Business quarter start frequency

A

Year end frequency

BA

Business year end frequency

AS

Year start frequency

BAS

Business year start frequency

BH

Business hour frequency

H

Hourly frequency

T

Minutely frequency

S

Secondly frequency

L

Milliseconds

U

Microseconds

N

Nanoseconds

Moreover, the offset aliases can be used in combination as well. Here, we are generating a datetime index with five elements, each one day, one hour, one minute and one second apart:

If we want to index data every 12 hours of our business time, which by default starts at 9 AM and ends at 5 PM, we would simply prefix the BH alias:

A custom definition of what a business hour means is also possible:

We can use this custom business hour to build indexes as well:

Some frequencies allow us to specify an anchoring suffix, which allows us to express intervals, such as every Friday or every second Tuesday of the month:

Finally, we can merge various indexes of different frequencies. The possibilities are endless. We only show one example, where we combine two indexes – each over a decade – one pointing to every first business day of a year and one to the last day of February:

We see, that 2000 and 2005 did not start on a weekday and that 2000, 2004, and 2008 were the leap years.

We have seen two powerful functions so far, to_datetime and date_range. Now we want to dive into time series by first showing how you can create and plot time series data with only a few lines. In the rest of this section, we will show various ways to access and slice time series data.

It is easy to get started with time series data in Pandas. A random walk can be created and plotted in a few lines:

A possible output of this plot is show in the following figure:

Working with date and time objects

Just as with usual series objects, you can select parts and slice the index:

We can use date strings as keys, even though our series has a DatetimeIndex:

Even though the DatetimeIndex is made of timestamp objects, we can use datetime objects as keys as well:

Access is similar to lookup in dictionaries or lists, but more powerful. We can, for example, slice with strings or even mixed objects:

It is even possible to use partial strings to select groups of entries. If we are only interested in February, we could simply write:

To see all entries from March until May, including:

Time series can be shifted forward or backward in time. The index stays in place, the values move:

To shift backwards in time, we simply use negative values:

Downsampling reduces the number of samples in the data. During this reduction, we are able to apply aggregations over data points. Let's imagine a busy airport with thousands of people passing through every hour. The airport administration has installed a visitor counter in the main area, to get an impression of exactly how busy their airport is.

They are receiving data from the counter device every minute. Here are the hypothetical measurements for a day, beginning at 08:00, ending 600 minutes later at 18:00:

To get a better picture of the day, we can downsample this time series to larger intervals, for example, 10 minutes. We can choose an aggregation function as well. The default aggregation is to take all the values and calculate the mean:

In our airport example, we are also interested in the sum of the values, that is, the combined number of visitors for a given time frame. We can choose the aggregation function by passing a function or a function name to the how parameter works:

Or we can reduce the sampling interval even more by resampling to an hourly interval:

We can ask for other things as well. For example, what was the maximum number of people that passed through our airport within one hour:

Or we can define a custom function if we are interested in more unusual metrics. For example, we could be interested in selecting a random sample for each hour:

If you specify a function by string, Pandas uses highly optimized versions.

The built-in functions that can be used as argument to how are: sum, mean, std, sem, max, min, median, first, last, ohlc. The ohlc metric is popular in finance. It stands for open-high-low-close. An OHLC chart is a typical way to illustrate movements in the price of a financial instrument over time.

While in our airport this metric might not be that valuable, we can compute it nonetheless:

In upsampling, the frequency of the time series is increased. As a result, we have more sample points than data points. One of the main questions is how to account for the entries in the series where we have no measurement.

Let's start with hourly data for a single day:

If we upsample to data points taken every 15 minutes, our time series will be extended with NaN values:

There are various ways to deal with missing values, which can be controlled by the fill_method keyword argument to resample. Values can be filled either forward or backward:

With the limit parameter, it is possible to control the number of missing values to be filled:

If you want to adjust the labels during resampling, you can use the loffset keyword argument:

There is another way to fill in missing values. We could employ an algorithm to construct new data points that would somehow fit the existing points, for some definition of somehow. This process is called interpolation.

We can ask Pandas to interpolate a time series for us:

We saw the default interpolate method – a linear interpolation – in action. Pandas assumes a linear relationship between two existing points.

Pandas supports over a dozen interpolation functions, some of which require the scipy library to be installed. We will not cover interpolation methods in this chapter, but we encourage you to explore the various methods yourself. The right interpolation method will depend on the requirements of your application.

While, by default, Pandas objects are time zone unaware, many real-world applications will make use of time zones. As with working with time in general, time zones are no trivial matter: do you know which countries have daylight saving time and do you know when the time zone is switched in those countries? Thankfully, Pandas builds on the time zone capabilities of two popular and proven utility libraries for time and date handling: pytz and dateutil:

To supply time zone information, you can use the tz keyword argument:

This works for ranges as well:

Time zone objects can also be constructed beforehand:

Sometimes, you will already have a time zone unaware time series object that you would like to make time zone aware. The tz_localize function helps to switch between time zone aware and time zone unaware objects:

To move a time zone aware object to other time zones, you can use the tz_convert method:

Finally, to detach any time zone information from an object, it is possible to pass None to either tz_convert or tz_localize:

Along with the powerful timestamp object, which acts as a building block for the DatetimeIndex, there is another useful data structure, which has been introduced in Pandas 0.15 – the Timedelta. The Timedelta can serve as a basis for indices as well, in this case a TimedeltaIndex.

Timedeltas are differences in times, expressed in difference units. The Timedelta class in Pandas is a subclass of datetime.timedelta from the Python standard library. As with other Pandas data structures, the Timedelta can be constructed from a variety of inputs:

As you would expect, Timedeltas allow basic arithmetic:

Similar to to_datetime, there is a to_timedelta function that can parse strings or lists of strings into Timedelta structures or TimedeltaIndices:

Instead of absolute dates, we could create an index of timedeltas. Imagine measurements from a volcano, for example. We might want to take measurements but index it from a given date, for example the date of the last eruption. We could create a timedelta index that has the last seven days as entries:

We could then work with time series data, indexed from the last eruption. If we had measurements for many eruptions (from possibly multiple volcanos), we would have an index that would make comparisons and analysis of this data easier. For example, we could ask whether there is a typical pattern that occurs between the third day and the fifth day after an eruption. This question would not be impossible to answer with a DatetimeIndex, but a TimedeltaIndex makes this kind of exploration much more convenient.

Pandas comes with great support for plotting, and this holds true for time series data as well.

As a first example, let's take some monthly data and plot it:

Since matplotlib is used under the hood, we can pass a familiar parameter to plot, such as c for color, or title for the chart title:

The following figure shows an example time series plot:

Time series plotting

We can overlay an aggregate plot over 2 and 5 years:

The following figure shows the resampled 2-year plot:

Time series plotting

The following figure shows the resample 5-year plot:

Time series plotting

We can pass the kind of chart to the plot method as well. The return value of the plot method is an AxesSubplot, which allows us to customize many aspects of the plot. Here we are setting the label values on the X axis to the year values from our time series:

Time series plotting

Let's imagine we have four time series that we would like to plot simultaneously. We generate a matrix of 1000 × 4 random values and treat each column as a separated time series:

Time series plotting

In this chapter we showed how you can work with time series in Pandas. We introduced two index types, the DatetimeIndex and the TimedeltaIndex and explored their building blocks in depth. Pandas comes with versatile helper functions that take much of the pain out of parsing dates of various formats or generating fixed frequency sequences. Resampling data can help get a more condensed picture of the data, or it can help align various datasets of different frequencies to one another. One of the explicit goals of Pandas is to make it easy to work with missing data, which is also relevant in the context of upsampling.

Finally, we showed how time series can be visualized. Since matplotlib and Pandas are natural companions, we discovered that we can reuse our previous knowledge about matplotlib for time series data as well.

In the next chapter, we will explore ways to load and store data in text files and databases.

Practice exercises

Exercise 1: Find one or two real-world examples for data sets, which could – in a sensible way – be assigned to the following groups:

Create various fixed frequency ranges: