Book Image

Learning pandas - Second Edition

By : Michael Heydt
Book Image

Learning pandas - Second Edition

By: Michael Heydt

Overview of this book

You will learn how to use pandas to perform data analysis in Python. You will start with an overview of data analysis and iteratively progress from modeling data, to accessing data from remote sources, performing numeric and statistical analysis, through indexing and performing aggregate analysis, and finally to visualizing statistical data and applying pandas to finance. With the knowledge you gain from this book, you will quickly learn pandas and how it can empower you in the exciting world of data manipulation, analysis and science.
Table of Contents (16 chapters)

Concepts of data and analysis in our tour of pandas

When learning pandas and data analysis you will come across many concepts in data, modeling and analysis. Let's examine several of these concepts and how they relate to pandas.

Types of data

Working with data in the wild you will come across several broad categories of data that will need to be coerced into pandas data structures. They are important to understand as the tools required to work with each type vary.

pandas is inherently used for manipulating structured data but provides several tools for facilitating the conversion of non-structured data into a means we can manipulate.

Structured

Structured data is any type of data that is organized as fixed fields within a record or file, such as data in relational databases and spreadsheets. Structured data depends upon a data model, which is the defined organization and meaning of the data and often how the data should be processed. This includes specifying the type of the data (integer, float, string, and so on), and any restrictions on the data, such as the number of characters, maximum and minimum values, or a restriction to a certain set of values.

Structured data is the type of data that pandas is designed to utilize. As we will see first with the Series and then with the DataFrame, pandas organizes structured data into one or more columns of data, each of a single and specific data type, and then a series of zero or more rows of data.

Unstructured

Unstructured data is data that is without any defined organization and which specifically does not break down into stringently defined columns of specific types. This can consist of many types of information such as photos and graphic images, videos, streaming sensor data, web pages, PDF files, PowerPoint presentations, emails, blog entries, wikis, and word processing documents.

While pandas does not manipulate unstructured data directly, it provides a number of facilities to extract structured data from unstructured sources. As a specific example that we will examine, pandas has tools to retrieve web pages and extract specific pieces of content into a DataFrame.

Semi-structured

Semi-structured data fits in between unstructured. It can be considered a type of structured data, but lacks the strict data model structure. JSON is a form of semi-structured data. While good JSON will have a defined format, there is no specific schema for data that is always strictly enforced. Much of the time, the data will be in a repeatable pattern that can be easily converted into structured data types like the pandas DataFrame, but the process may need some guidance from you to specify or coerce data types.

Variables

When modeling data in pandas, we will be modeling one or more variables and looking to find statistical meaning amongst the values or across multiple variables. This definition of a variable is not in the sense of a variable in a programming language but one of statistical variables.

A variable is any characteristic, number, or quantity that can be measured or counted. A variable is so named because the value may vary between data units in a population and may change in value over time. Stock value, age, sex, business income and expenses, country of birth, capital expenditure, class grades, eye color, and vehicle type are examples of variables.

There are several broad types of statistical variables that we will come across when using pandas:

  • Categorical
  • Continuous
  • Discrete

Categorical

A categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values. Each of the possible values is often referred to as a level. Categorical variables in pandas are represented by Categoricals, a pandas data type which corresponds to categorical variables in statistics. Examples of categorical variables are gender, social class, blood types, country affiliations, observation time, or ratings such as Likert scales.

Continuous

A continuous variable is a variable that can take on infinitely many (an uncountable number of) values. Observations can take any value between a certain set of real numbers. Examples of continuous variables include height, time, and temperature. Continuous variables in pandas are represented by either float or integer types (native to Python), typically in collections that represent multiple samplings of the specific variable.

Discrete

A discrete variable is a variable where the values are based on a count from a set of distinct whole values. A discrete variable cannot be a fractional value between any two variables. Examples of discrete variables include the number of registered cars, number of business locations, and number of children in a family, all of which measure whole units (for example 1, 2, or 3 children). Discrete variables are normally represented in pandas by integers (or occasionally floats), again normally in collections of two or more samplings of a variable.

Time series data

Time series data is a first-class entity within pandas. Time adds an important, extra dimension to samples of variables within pandas. Often variables are independent of the time they were sampled at; that is, the time at which they are sampled is not important. But in many cases they are. A time series forms a sample of a discrete variable at specific time intervals, where the observations have a natural temporal ordering.

A stochastic model for a time series will generally reflect the fact that observations close together in time will be more closely related than observations that are further apart. Time series models will often make use of the natural one-way ordering of time so that values for a given period will be expressed as deriving in some way from past values rather than from future values.

A common scenario with pandas is financial data where a variable represents the value of a stock as it changes at regular intervals throughout the day. We often want to determine changes in the rate of change of the price at specific intervals. We may also want to correlate the price of multiple stocks across specific intervals of time.

This is such an important and robust capability in pandas that we will spend an entire chapter examining the concept.

General concepts of analysis and statistics

In this text, we will only approach the periphery of statistics and the technical processes of data analysis. But several analytical concepts of are worth noting, some of which have implementations directly created within pandas. Others will need to rely on other libraries such as SciPy, but you may also come across them while working with pandas so an initial shout-out is valuable.

Quantitative versus qualitative data/analysis

Qualitative analysis is the scientific study of data that can be observed but cannot be measured. It focuses on cataloging the qualities of data. Examples of qualitative data can be:

  • The softness of your skin
  • How elegantly someone runs

Quantitative analysis is the study of actual values within data, with real measurements of items presented as data. Normally, these are values such as:

  • Quantity
  • Price
  • Height

pandas deals primarily with quantitative data, providing you with extensive tools for representing observations of variables. Pandas does not provide for qualitative analysis, but does let you represent qualitative information.

Single and multivariate analysis

Statistics, from a certain perspective, is the practice of studying variables, and specifically the observation of those variables. Much of statistics is based upon doing this analysis for a single variable, which is referred to as univariate analysis. Univariate analysis is the simplest form of analyzing data. It does not deal with causes or relationships and is normally used to describe or summarize data, and to find patterns in it.

Multivariate analysis is a modeling technique where there exist two or more output variables that affect the outcome of an experiment. Multivariate analysis is often related to concepts such as correlation and regression, which help us understand the relationships between multiple variables, as well as how those relationships affect the outcome.

pandas primarily provides fundamental univariate analysis capabilities. And these capabilities are generally descriptive statistics, although there is inherent support for concepts such as correlations (as they are very common in finance and other domains).

Other more complex statistics can be performed with StatsModels. Again, this is not per se a weakness of pandas, but a specific design decision to let those concepts be handled by other dedicated Python libraries.

Descriptive statistics

Descriptive statistics are functions that summarize a given dataset, typically where the dataset represents a population or sample of a single variable (univariate data). They describe the dataset and form measures of a central tendency and measures of variability and dispersion.

For example, the following are descriptive statistics:

  • The distribution (for example, normal, Poisson)
  • The central tendency (for example, mean, median, and mode)
  • The dispersion (for example, variance, standard deviation)

As we will see, the pandas Series and DataFrame objects have integrated support for a large number of descriptive statistics.

Inferential statistics

Inferential statistics differs from descriptive statistics in that inferential statistics attempts to infer conclusions from data instead of simply summarizing it. Examples of inferential statistics include:

  • t-test
  • chi square
  • ANOVA
  • Bootstrapping

These inferential techniques are generally deferred from pandas to other tools such as SciPy and StatsModels.

Stochastic models

Stochastic models are a form of statistical modeling that includes one or more random variables, and typically includes use of time series data. The purpose of a stochastic model is to estimate the chance that an outcome is within a specific forecast to predict conditions for different situations.

An example of stochastic modeling is the Monte Carlo simulation. The Monte Carlo simulation is often used for financial portfolio evaluation by simulating the performance of a portfolio based upon repeated simulation of the portfolio in markets that are influenced by various factors and the inherent probability distributions of the constituent stock returns.

pandas gives us the fundamental data structure for stochastic models in the DataFrame, often using time series data, to get up and running for stochastic models. While it is possible to code your own stochastic models and analyses using pandas and Python, in many cases there are domain-specific libraries such as PyMC to facilitate this type of modeling.

Probability and Bayesian statistics

Bayesian statistics is an approach to statistical inference, derived from Bayes' theorem, a mathematical equation built off simple probability axioms. It allows an analyst to calculate any conditional probability of interest. A conditional probability is simply the probability of event A given that event B has occurred.

Therefore, in probability terms, the data events have already occurred and have been collected (since we know the probability). By using Bayes' theorem, we can then calculate the probability of various things of interest, given or conditional upon, this already observed data.

Bayesian modeling is beyond the scope of this book, but again the underlying data models are well handled using pandas and then actually analyzed using libraries such as PyMC.

Correlation

Correlation is one of the most common statistics and is directly built into the pandas DataFrame. A correlation is a single number that describes the degree of relationship between two variables, and specifically between two sequences of observations of those variables.

A common example of using a correlation is to determine how closely the prices of two stocks follows each other as time progresses. If the changes move closely, the two stocks have a high correlation, and if there is no discernible pattern they are uncorrelated. This is valuable information that can be used in a number of investment strategies.

The level of correlation of two stocks can also vary slightly with the time frame of the entire dataset, as well as the interval. Fortunately, pandas has powerful capabilities for us to easily change these parameters and rerun correlations. We will look at correlations in several places later in the book.

Regression

Regression is a statistical measure that estimates the strength of relationship between a dependent variable and a series of other variables. It can be used to understand the relationships between variables. An example in finance would be understanding the relationship between commodity prices and the stocks of businesses dealing in those commodities.

There was originally a regression model built directly into pandas, but it has been moved out into the StatsModels library. This shows a pattern common in pandas. Often pandas has concepts built into it, but as they mature they are deemed to fit most effectively into other Python libraries. This is both good and bad. It is initially great to have it directly in pandas, but as you upgrade to new versions of pandas it can break your code!