Book Image

Python: End-to-end Data Analysis

By : Ivan Idris, Luiz Felipe Martins, Martin Czygan, Phuong Vo.T.H, Magnus Vilhelm Persson
Book Image

Python: End-to-end Data Analysis

By: Ivan Idris, Luiz Felipe Martins, Martin Czygan, Phuong Vo.T.H, Magnus Vilhelm Persson

Overview of this book

Data analysis is the process of applying logical and analytical reasoning to study each component of data present in the system. Python is a multi-domain, high-level, programming language that offers a range of tools and libraries suitable for all purposes, it has slowly evolved as one of the primary languages for data science. Have you ever imagined becoming an expert at effectively approaching data analysis problems, solving them, and extracting all of the available information from your data? If yes, look no further, this is the course you need! In this course, we will get you started with Python data analysis by introducing the basics of data analysis and supported Python libraries such as matplotlib, NumPy, and pandas. Create visualizations by choosing color maps, different shapes, sizes, and palettes then delve into statistical data analysis using distribution algorithms and correlations. You’ll then find your way around different data and numerical problems, get to grips with Spark and HDFS, and set up migration scripts for web mining. You’ll be able to quickly and accurately perform hands-on sorting, reduction, and subsequent analysis, and fully appreciate how data analysis methods can support business decision-making. Finally, you will delve into advanced techniques such as performing regression, quantifying cause and effect using Bayesian methods, and discovering how to use Python’s tools for supervised machine learning. The course provides you with highly practical content explaining data analysis with Python, from the following Packt books: 1. Getting Started with Python Data Analysis. 2. Python Data Analysis Cookbook. 3. Mastering Python Data Analysis. By the end of this course, you will have all the knowledge you need to analyze your data with varying complexity levels, and turn it into actionable insights.
Table of Contents (6 chapters)

Data visualization is concerned with the presentation of data in a pictorial or graphical form. It is one of the most important tasks in data analysis, since it enables us to see analytical results, detect outliers, and make decisions for model building. There are many Python libraries for visualization, of which matplotlib, seaborn, bokeh, and ggplot are among the most popular. However, in this chapter, we mainly focus on the matplotlib library that is used by many people in many different contexts.

Matplotlib produces publication-quality figures in a variety of formats, and interactive environments across Python platforms. Another advantage is that Pandas comes equipped with useful wrappers around several matplotlib plotting routines, allowing for quick and handy plotting of Series and DataFrame objects.

The IPython package started as an alternative to the standard interactive Python shell, but has since evolved into an indispensable tool for data exploration, visualization, and rapid prototyping. It is possible to use the graphical capabilities offered by matplotlib from IPython through various options, of which the simplest to get started with is the pylab flag:

This flag will preload matplotlib and numpy for interactive use with the default matplotlib backend. IPython can run in various environments: in a terminal, as a Qt application, or inside a browser. These options are worth exploring, since IPython has enjoyed adoption for many use cases, such as prototyping, interactive slides for more engaging conference talks or lectures, and as a tool for sharing research.

The easiest way to get started with plotting using matplotlib is often by using the MATLAB API that is supported by the package:

The output for the preceding command is as follows:

The matplotlib API primer

However, star imports should not be used unless there is a good reason for doing so. In the case of matplotlib, we can use the canonical import:

The preceding example could then be written as follows:

The output for the preceding command is as follows:

The matplotlib API primer

If we only provide a single argument to the plot function, it will automatically use it as the y values and generate the x values from 0 to N-1, where N is equal to the number of values:

The output for the preceding command is as follows:

The matplotlib API primer

By default, the range of the axes is constrained by the range of the input x and y data. If we want to specify the viewport of the axes, we can use the axis() method to set custom ranges. For example, in the previous visualization, we could increase the range of the x axis from [0, 5] to [0, 6], and that of the y axis from [0, 9] to [0, 10], by writing the following command:

By default, all plotting commands apply to the current figure and axes. In some situations, we want to visualize data in multiple figures and axes to compare different plots or to use the space on a page more efficiently. There are two steps required before we can plot the data. Firstly, we have to define which figure we want to plot. Secondly, we need to figure out the position of our subplot in the figure:

The output for the preceding command is as follows:

Figures and subplots

In this case, we currently have the figure a. If we want to modify any subplot in figure a, we first call the command to select the figure and subplot, and then execute the function to modify the subplot. Here, for example, we change the title of the second plot of our four-plot figure:

The output for the preceding command is as follows:

Figures and subplots

There is a convenience method, plt.subplots(), to creating a figure that contains a given number of subplots. As inthe previous example, we can use the plt.subplots(2,2) command to create a 2x2 figure that consists of four subplots.

We can also create the axes manually, instead of rectangular grid, by using the plt.axes([left, bottom, width, height]) command, where all input parameters are in the fractional [0, 1] coordinates:

The output for the preceding command is as follows:

Figures and subplots

However, when you manually create axes, it takes more time to balance coordinates and sizes between subplots to arrive at a well-proportioned figure.

We have looked at how to create simple line plots so far. The matplotlib library supports many more plot types that are useful for data visualization. However, our goal is to provide the basic knowledge that will help you to understand and use the library for visualizing data in the most common situations. Therefore, we will only focus on four kinds of plot types: scatter plots, bar plots, contour plots, and histograms.

Scatter plots

A
Bar plots

A
Contour plots

We use Histogram plots

A

Legends are an important element that is used to identify the plot elements in a figure. The easiest way to show a legend inside a figure is to use the label argument of the plot function, and show the labels by calling the plt.legend() method:

The output for the preceding command as follows:

Legends and annotations

The loc argument in the legend command is used to figure out the position of the label box. There are several valid location options: lower left, right, upper left, lower center, upper right, center, lower right, upper right, center right, best, upper center, and center left. The default position setting is upper right. However, when we set an invalid location option that does not exist in the above list, the function automatically falls back to the best option.

If we want to split the legend into multiple boxes in a figure, we can manually set our expected labels for plot lines, as shown in the following image:

Legends and annotations

The output for the preceding command is as follows:

The other element in a figure that we want to introduce is the annotations which can consist of text, arrows, or other shapes to explain parts of the figure in detail, or to emphasize some special data points. There are different methods for showing annotations, such as text, arrow, and annotation.

Here is a simple example to illustrate the annotate and text functions:

The output for the preceding command is as follows:

Legends and annotations

We have covered most of the important components in a plot figure using matplotlib. In this section, we will introduce another powerful plotting method for directly creating standard visualization from Pandas data objects that are often used to manipulate data.

For Series or DataFrame objects in Pandas, most plotting types are supported, such as line, bar, box, histogram, and scatter plots, and pie charts. To select a plot type, we use the kind argument of the plot function. With no kind of plot specified, the plot function will generate a line style visualization by default , as in the following example:

The output for the preceding command is as follows:

Plotting functions with Pandas

Another example will visualize the data of a DataFrame object consisting of multiple columns:

The output for the preceding command is as follows:

Plotting functions with Pandas

The plot method of the DataFrame has a number of options that allow us to handle the plotting of the columns. For example, in the above DataFrame visualization, we chose to plot the columns in separate subplots. The following table lists more options:

Argument

Value

Description

subplots

True/False

The plots each data column in a separate subplot

logy

True/False

The gets a log-scale y axis

secondary_y

True/False

The plots data on a secondary y axis

sharex, sharey

True/False

The shares the same x or y axis, linking sticks and limits

Besides matplotlib, there are other powerful data visualization toolkits based on Python. While we cannot dive deeper into these libraries, we would like to at least briefly introduce them in this session.

Bokeh

Bokeh is
MayaVi

MayaVi

We finished covering most of the basics, such as functions, arguments, and properties for data visualization, based on the matplotlib library. We hope that, through the examples, you will be able to understand and apply them to your own problems. In general, to visualize data, we need to consider five steps- that is, getting data into suitable Python or Pandas data structures, such as lists, dictionaries, Series, or DataFrames. We explained in the previous chapters, how to accomplish this step. The second step is defining plots and subplots for the data object in question. We discussed this in the figures and subplots session. The third step is selecting a plot style and its attributes to show in the subplots such as: line, bar, histogram, scatter plot, line style, and color. The fourth step is adding extra components to the subplots, like legends, annotations and text. The fifth step is displaying or saving the results.

By now, you can do quite a few things with a dataset; for example, manipulation, cleaning, exploration, and visualization based on Python libraries such as Numpy, Pandas, and matplotlib. You can now combine this knowledge and practice with these libraries to get more and more familiar with Python data analysis.

Practice exercises: