Pandas is a Python package that supports fast, flexible, and expressive data structures, as well as computing functions for data analysis. The following are some prominent features that Pandas supports:
- Data structure with labeled axes. This makes the program clean and clear and avoids common errors from misaligned data.
- Flexible handling of missing data.
- Intelligent label-based slicing, fancy indexing, and subset creation of large datasets.
- Powerful arithmetic operations and statistical computations on a custom axis via axis label.
- Robust input and output support for loading or saving data from and to files, databases, or HDF5 format.
Related to Pandas installation, we recommend an easy way, that is to install it as a part of Anaconda, a cross-platform distribution for data analysis and scientific computing. You can refer to the reference at http://docs.continuum.io/anaconda/ to download and install the library.
After installation, we can use it like other Python packages. Firstly, we have to import the following packages at the beginning of the program:
>>> import pandas as pd >>> import numpy as np
Let's first get acquainted with two of Pandas' primary data structures: the Series and the DataFrame. They can handle the majority of use cases in finance, statistic, social science, and many areas of engineering.
A Series is a one-dimensional object similar to an array, list, or column in table. Each item in a Series is assigned to an entry in an index:
We can access the value of a Series by using the index:
Sometimes, we want to filter or rename the index of a Series created from a Python dictionary. At such times, we can pass the selected index list directly to the initial function, similarly to the process in the above example. Only elements that exist in the index list will be in the Series object. Conversely, indexes that are missing in the dictionary are initialized to default NaN
values by Pandas:
The library also supports functions that detect missing data:
Similarly, we can also initialize a Series from a scalar value:
The DataFrame is a tabular data structure comprising a set of ordered columns and rows. It can be thought of as a group of Series objects that share an index (the column names). There are a number of ways to initialize a DataFrame object. Firstly, let's take a look at the common example of creating DataFrame from a dictionary of lists:
We can provide the index labels of a DataFrame similar to a Series:
We can construct a DataFrame out of nested lists as well:
Columns can be accessed by column name as a Series can, either by dictionary-like notation or as an attribute, if the column name is a syntactically valid attribute name:
Using a couple of methods, rows can be retrieved by position or name:
Another common case is to provide a DataFrame with data from a location such as a text file. In this situation, we use the read_csv
function that expects the column separator to be a comma, by default. However, we can change that by using the sep
parameter:
sep
: This is a delimiter between columns. The default is comma symbol.dtype
: This is a data type for data or columns.header
: This sets row numbers to use as the column names.skiprows
: This skips line numbers to skip at the start of the file.error_bad_lines
: This shows invalid lines (too many fields) that will, by default, cause an exception, such that no DataFrame will be returned. If we set the value of this parameter asfalse
, the bad lines will be skipped.
Series is a one-dimensional object similar to an array, list, or column in table. Each item in a Series is assigned to an entry in an index:
We can access the value of a Series by using the index:
Sometimes, we want to filter or rename the index of a Series created from a Python dictionary. At such times, we can pass the selected index list directly to the initial function, similarly to the process in the above example. Only elements that exist in the index list will be in the Series object. Conversely, indexes that are missing in the dictionary are initialized to default NaN
values by Pandas:
The library also supports functions that detect missing data:
Similarly, we can also initialize a Series from a scalar value:
The DataFrame is a tabular data structure comprising a set of ordered columns and rows. It can be thought of as a group of Series objects that share an index (the column names). There are a number of ways to initialize a DataFrame object. Firstly, let's take a look at the common example of creating DataFrame from a dictionary of lists:
We can provide the index labels of a DataFrame similar to a Series:
We can construct a DataFrame out of nested lists as well:
Columns can be accessed by column name as a Series can, either by dictionary-like notation or as an attribute, if the column name is a syntactically valid attribute name:
Using a couple of methods, rows can be retrieved by position or name:
Another common case is to provide a DataFrame with data from a location such as a text file. In this situation, we use the read_csv
function that expects the column separator to be a comma, by default. However, we can change that by using the sep
parameter:
sep
: This is a delimiter between columns. The default is comma symbol.dtype
: This is a data type for data or columns.header
: This sets row numbers to use as the column names.skiprows
: This skips line numbers to skip at the start of the file.error_bad_lines
: This shows invalid lines (too many fields) that will, by default, cause an exception, such that no DataFrame will be returned. If we set the value of this parameter asfalse
, the bad lines will be skipped.
We can provide the index labels of a DataFrame similar to a Series:
We can construct a DataFrame out of nested lists as well:
Columns can be accessed by column name as a Series can, either by dictionary-like notation or as an attribute, if the column name is a syntactically valid attribute name:
Using a couple of methods, rows can be retrieved by position or name:
Another common case is to provide a DataFrame with data from a location such as a text file. In this situation, we use the read_csv
function that expects the column separator to be a comma, by default. However, we can change that by using the sep
parameter:
sep
: This is a delimiter between columns. The default is comma symbol.dtype
: This is a data type for data or columns.header
: This sets row numbers to use as the column names.skiprows
: This skips line numbers to skip at the start of the file.error_bad_lines
: This shows invalid lines (too many fields) that will, by default, cause an exception, such that no DataFrame will be returned. If we set the value of this parameter asfalse
, the bad lines will be skipped.
Pandas supports many essential functionalities that are useful to manipulate Pandas data structures. In this book, we will focus on the most important features regarding exploration and analysis.
Reindex is a critical method in the Pandas data structures. It confirms whether the new or modified data satisfies a given set of labels along a particular axis of Pandas object.
First, let's view a reindex
example on a Series object:
Argument |
Description |
---|---|
|
This is the new labels/index to conform to. |
|
This is the method to use for filling holes in a
|
|
This return a new object. The default setting is |
|
The matches index values on the passed multiple index level. |
|
This is the value to use for missing values. The default setting is |
|
This is the maximum size gap to fill in |
In common data analysis situations, our data structure objects contain many columns and a large number of rows. Therefore, we cannot view or load all information of the objects. Pandas supports functions that allow us to inspect a small sample. By default, the functions return five elements, but we can set a custom number as well. The following example shows how to display the first five and the last three rows of a longer Series:
We can also use these functions for DataFrame objects in the same way.
Firstly, we will consider arithmetic operations between objects. In different indexes objects case, the expected result will be the union of the index pairs. We will not explain this again because we had an example about it in the above section (s5 + s6
). This time, we will show another example with a DataFrame:
The mechanisms for returning the result between two kinds of data structure are similar. A problem that we need to consider is the missing data between objects. In this case, if we want to fill with a fixed value, such as 0
, we can use the arithmetic functions such as add
, sub
, div
, and mul
, and the function's supported parameters such as fill_value
:
Next, we will discuss comparison
operations between data objects. We have some supported functions such as
equal (eq), not equal (ne), greater than (gt), less than (lt), less equal (le), and greater equal (ge). Here is an example:
The supported statistics method of a library is really important in data analysis. To get inside a big data object, we need to know some summarized information such as mean, sum, or quantile. Pandas supports a large number of methods to compute them. Let's consider a simple example of calculating the sum
information of df5
, which is a DataFrame object:
When we do not specify which axis we want to calculate sum
information, by default, the function will calculate on index axis, which is axis 0
:
Here, we have a summary table for common supported statistics functions in Pandas:
Function |
Description |
---|---|
|
This compute the index labels with the minimum or maximum corresponding values. |
|
This compute the frequency of unique values. |
|
This return the number of non-null values in a data object. |
|
This return mean, median, minimum, and maximum values of an axis in a data object. |
|
These return the standard deviation, variance, and standard error of mean. |
|
This gets the absolute value of a data object. |
Pandas supports function application that allows us to apply some functions supported in other packages such as NumPy or our own functions on data structure objects. Here, we illustrate two examples of these cases, firstly, using apply
to execute the std()
function, which is the standard deviation calculating function of the NumPy package:
- Define the function or formula that you want to apply on a data object.
- Call the defined function or formula via
apply
. In this step, we also need to figure out the axis that we want to apply the calculation to:>>> f = lambda x: x.max() – x.min() # step 1 >>> df5.apply(f, axis=1) # step 2 0 2 1 2 2 2 dtype: int64 >>> def sigmoid(x): return 1/(1 + np.exp(x)) >>> df5.apply(sigmoid) a b c 0 0.500000 0.268941 0.119203 1 0.047426 0.017986 0.006693 2 0.002473 0.000911 0.000335
is a critical method in the Pandas data structures. It confirms whether the new or modified data satisfies a given set of labels along a particular axis of Pandas object.
First, let's view a reindex
example on a Series object:
Argument |
Description |
---|---|
|
This is the new labels/index to conform to. |
|
This is the method to use for filling holes in a
|
|
This return a new object. The default setting is |
|
The matches index values on the passed multiple index level. |
|
This is the value to use for missing values. The default setting is |
|
This is the maximum size gap to fill in |
In common data analysis situations, our data structure objects contain many columns and a large number of rows. Therefore, we cannot view or load all information of the objects. Pandas supports functions that allow us to inspect a small sample. By default, the functions return five elements, but we can set a custom number as well. The following example shows how to display the first five and the last three rows of a longer Series:
We can also use these functions for DataFrame objects in the same way.
Firstly, we will consider arithmetic operations between objects. In different indexes objects case, the expected result will be the union of the index pairs. We will not explain this again because we had an example about it in the above section (s5 + s6
). This time, we will show another example with a DataFrame:
The mechanisms for returning the result between two kinds of data structure are similar. A problem that we need to consider is the missing data between objects. In this case, if we want to fill with a fixed value, such as 0
, we can use the arithmetic functions such as add
, sub
, div
, and mul
, and the function's supported parameters such as fill_value
:
Next, we will discuss comparison
operations between data objects. We have some supported functions such as
equal (eq), not equal (ne), greater than (gt), less than (lt), less equal (le), and greater equal (ge). Here is an example:
The supported statistics method of a library is really important in data analysis. To get inside a big data object, we need to know some summarized information such as mean, sum, or quantile. Pandas supports a large number of methods to compute them. Let's consider a simple example of calculating the sum
information of df5
, which is a DataFrame object:
When we do not specify which axis we want to calculate sum
information, by default, the function will calculate on index axis, which is axis 0
:
Here, we have a summary table for common supported statistics functions in Pandas:
Function |
Description |
---|---|
|
This compute the index labels with the minimum or maximum corresponding values. |
|
This compute the frequency of unique values. |
|
This return the number of non-null values in a data object. |
|
This return mean, median, minimum, and maximum values of an axis in a data object. |
|
These return the standard deviation, variance, and standard error of mean. |
|
This gets the absolute value of a data object. |
Pandas supports function application that allows us to apply some functions supported in other packages such as NumPy or our own functions on data structure objects. Here, we illustrate two examples of these cases, firstly, using apply
to execute the std()
function, which is the standard deviation calculating function of the NumPy package:
- Define the function or formula that you want to apply on a data object.
- Call the defined function or formula via
apply
. In this step, we also need to figure out the axis that we want to apply the calculation to:>>> f = lambda x: x.max() – x.min() # step 1 >>> df5.apply(f, axis=1) # step 2 0 2 1 2 2 2 dtype: int64 >>> def sigmoid(x): return 1/(1 + np.exp(x)) >>> df5.apply(sigmoid) a b c 0 0.500000 0.268941 0.119203 1 0.047426 0.017986 0.006693 2 0.002473 0.000911 0.000335
Firstly, we will consider arithmetic operations between objects. In different indexes objects case, the expected result will be the union of the index pairs. We will not explain this again because we had an example about it in the above section (s5 + s6
). This time, we will show another example with a DataFrame:
The mechanisms for returning the result between two kinds of data structure are similar. A problem that we need to consider is the missing data between objects. In this case, if we want to fill with a fixed value, such as 0
, we can use the arithmetic functions such as add
, sub
, div
, and mul
, and the function's supported parameters such as fill_value
:
Next, we will discuss comparison
operations between data objects. We have some supported functions such as
equal (eq), not equal (ne), greater than (gt), less than (lt), less equal (le), and greater equal (ge). Here is an example:
The supported statistics method of a library is really important in data analysis. To get inside a big data object, we need to know some summarized information such as mean, sum, or quantile. Pandas supports a large number of methods to compute them. Let's consider a simple example of calculating the sum
information of df5
, which is a DataFrame object:
When we do not specify which axis we want to calculate sum
information, by default, the function will calculate on index axis, which is axis 0
:
Here, we have a summary table for common supported statistics functions in Pandas:
Function |
Description |
---|---|
|
This compute the index labels with the minimum or maximum corresponding values. |
|
This compute the frequency of unique values. |
|
This return the number of non-null values in a data object. |
|
This return mean, median, minimum, and maximum values of an axis in a data object. |
|
These return the standard deviation, variance, and standard error of mean. |
|
This gets the absolute value of a data object. |
Pandas supports function application that allows us to apply some functions supported in other packages such as NumPy or our own functions on data structure objects. Here, we illustrate two examples of these cases, firstly, using apply
to execute the std()
function, which is the standard deviation calculating function of the NumPy package:
- Define the function or formula that you want to apply on a data object.
- Call the defined function or formula via
apply
. In this step, we also need to figure out the axis that we want to apply the calculation to:>>> f = lambda x: x.max() – x.min() # step 1 >>> df5.apply(f, axis=1) # step 2 0 2 1 2 2 2 dtype: int64 >>> def sigmoid(x): return 1/(1 + np.exp(x)) >>> df5.apply(sigmoid) a b c 0 0.500000 0.268941 0.119203 1 0.047426 0.017986 0.006693 2 0.002473 0.000911 0.000335
The mechanisms for returning the result between two kinds of data structure are similar. A problem that we need to consider is the missing data between objects. In this case, if we want to fill with a fixed value, such as 0
, we can use the arithmetic functions such as add
, sub
, div
, and mul
, and the function's supported parameters such as fill_value
:
Next, we will discuss comparison
operations between data objects. We have some supported functions such as
equal (eq), not equal (ne), greater than (gt), less than (lt), less equal (le), and greater equal (ge). Here is an example:
The supported statistics method of a library is really important in data analysis. To get inside a big data object, we need to know some summarized information such as mean, sum, or quantile. Pandas supports a large number of methods to compute them. Let's consider a simple example of calculating the sum
information of df5
, which is a DataFrame object:
When we do not specify which axis we want to calculate sum
information, by default, the function will calculate on index axis, which is axis 0
:
Here, we have a summary table for common supported statistics functions in Pandas:
Function |
Description |
---|---|
|
This compute the index labels with the minimum or maximum corresponding values. |
|
This compute the frequency of unique values. |
|
This return the number of non-null values in a data object. |
|
This return mean, median, minimum, and maximum values of an axis in a data object. |
|
These return the standard deviation, variance, and standard error of mean. |
|
This gets the absolute value of a data object. |
Pandas supports function application that allows us to apply some functions supported in other packages such as NumPy or our own functions on data structure objects. Here, we illustrate two examples of these cases, firstly, using apply
to execute the std()
function, which is the standard deviation calculating function of the NumPy package:
- Define the function or formula that you want to apply on a data object.
- Call the defined function or formula via
apply
. In this step, we also need to figure out the axis that we want to apply the calculation to:>>> f = lambda x: x.max() – x.min() # step 1 >>> df5.apply(f, axis=1) # step 2 0 2 1 2 2 2 dtype: int64 >>> def sigmoid(x): return 1/(1 + np.exp(x)) >>> df5.apply(sigmoid) a b c 0 0.500000 0.268941 0.119203 1 0.047426 0.017986 0.006693 2 0.002473 0.000911 0.000335
supported statistics method of a library is really important in data analysis. To get inside a big data object, we need to know some summarized information such as mean, sum, or quantile. Pandas supports a large number of methods to compute them. Let's consider a simple example of calculating the sum
information of df5
, which is a DataFrame object:
When we do not specify which axis we want to calculate sum
information, by default, the function will calculate on index axis, which is axis 0
:
Here, we have a summary table for common supported statistics functions in Pandas:
Function |
Description |
---|---|
|
This compute the index labels with the minimum or maximum corresponding values. |
|
This compute the frequency of unique values. |
|
This return the number of non-null values in a data object. |
|
This return mean, median, minimum, and maximum values of an axis in a data object. |
|
These return the standard deviation, variance, and standard error of mean. |
|
This gets the absolute value of a data object. |
Pandas supports function application that allows us to apply some functions supported in other packages such as NumPy or our own functions on data structure objects. Here, we illustrate two examples of these cases, firstly, using apply
to execute the std()
function, which is the standard deviation calculating function of the NumPy package:
- Define the function or formula that you want to apply on a data object.
- Call the defined function or formula via
apply
. In this step, we also need to figure out the axis that we want to apply the calculation to:>>> f = lambda x: x.max() – x.min() # step 1 >>> df5.apply(f, axis=1) # step 2 0 2 1 2 2 2 dtype: int64 >>> def sigmoid(x): return 1/(1 + np.exp(x)) >>> df5.apply(sigmoid) a b c 0 0.500000 0.268941 0.119203 1 0.047426 0.017986 0.006693 2 0.002473 0.000911 0.000335
- Define the function or formula that you want to apply on a data object.
- Call the defined function or formula via
apply
. In this step, we also need to figure out the axis that we want to apply the calculation to:>>> f = lambda x: x.max() – x.min() # step 1 >>> df5.apply(f, axis=1) # step 2 0 2 1 2 2 2 dtype: int64 >>> def sigmoid(x): return 1/(1 + np.exp(x)) >>> df5.apply(sigmoid) a b c 0 0.500000 0.268941 0.119203 1 0.047426 0.017986 0.006693 2 0.002473 0.000911 0.000335
In this section, we will focus on how to get, set, or slice subsets of Pandas data structure objects. As we learned in previous sections, Series or DataFrame objects have axis labeling information. This information can be used to identify items that we want to select or assign a new value to in the object:
If the data object is a DataFrame structure, we can also proceed in a similar way:
For label indexing on the rows of DataFrame, we use the ix
function that enables us to select a set of rows and columns in the object. There are two parameters that we need to specify: the row
and column
labels that we want to get. By default, if we do not specify the selected column names, the function will return selected rows with all columns in the object:
Method |
Description |
---|---|
|
This selects a single row or column by integer location. |
|
This selects or sets a single value of a data object by row or column label. |
|
This selects a single column or row as a Series by label. |
Let's start with correlation and covariance computation between two data objects. Both the Series and DataFrame have a cov
method. On a DataFrame object, this method will compute the covariance between the Series inside the object:
We also have the corrwith
function that supports calculating correlations between Series that have the same label contained in different DataFrame objects:
In this section, we will discuss missing, NaN
, or null
values, in Pandas data structures. It is a very common situation to arrive with missing data in an object. One such case that creates missing data is reindexing:
>>> df10.isnull() a b c 3 False False False 2 False False False a True True True 0 False False False
On a Series, we can drop all null
data and index values by using the dropna
function:
Another way to control missing values is to use the supported parameters of functions that we introduced in the previous section. They are also very useful to solve this problem. In our experience, we should assign a fixed value in missing cases when we create data objects. This will make our objects cleaner in later processing steps. For example, consider the following:
We can alse use the fillna
function to fill a custom value in missing values:
In this section we will consider some advanced Pandas use cases.
Hierarchical indexing provides us with a way to work with higher dimensional data in a lower dimension by structuring the data object into multiple index levels on an axis:
In the preceding example, we have a Series object that has two index levels. The object can be rearranged into a DataFrame using the unstack
function. In an inverse situation, the stack
function can be used:
>>> s8.unstack() 0 1 a 0.549211 0.420874 b 0.051516 0.715021 c 0.503072 0.720772 d 0.373037 0.207026
We can also create a DataFrame to have a hierarchical index in both axes:
After grouping data into multiple index levels, we can also use most of the descriptive and statistics functions that have a level option, which can be used to specify the level we want to process:
The Panel is another data structure for three-dimensional data in Pandas. However, it is less frequently used than the Series or the DataFrame. You can think of a Panel as a table of DataFrame objects. We can create a Panel object from a 3D ndarray
or a dictionary of DataFrame objects:
Each item in a Panel is a DataFrame. We can select an item, by item name:
Alternatively, if we want to select data via an axis or data position, we can use the ix
method, like on Series or DataFrame:
In the preceding example, we have a Series object that has two index levels. The object can be rearranged into a DataFrame using the unstack
function. In an inverse situation, the stack
function can be used:
>>> s8.unstack() 0 1 a 0.549211 0.420874 b 0.051516 0.715021 c 0.503072 0.720772 d 0.373037 0.207026
We can also create a DataFrame to have a hierarchical index in both axes:
After grouping data into multiple index levels, we can also use most of the descriptive and statistics functions that have a level option, which can be used to specify the level we want to process:
The Panel is another data structure for three-dimensional data in Pandas. However, it is less frequently used than the Series or the DataFrame. You can think of a Panel as a table of DataFrame objects. We can create a Panel object from a 3D ndarray
or a dictionary of DataFrame objects:
Each item in a Panel is a DataFrame. We can select an item, by item name:
Alternatively, if we want to select data via an axis or data position, we can use the ix
method, like on Series or DataFrame:
The link https://www.census.gov/2010census/csv/pop_change.csv contains an US census dataset. It has 23 columns and one row for each US state, as well as a few rows for macro regions such as North, South, and West.
- Get this dataset into a Pandas DataFrame. Hint: just skip those rows that do not seem helpful, such as comments or description.
- While the dataset contains change metrics for each decade, we are interested in the population change during the second half of the twentieth century, that is between, 1950 and 2000. Which region has seen the biggest and the smallest population growth in this time span? Also, which US state?
Advanced open-ended exercise:
- Find more census data on the internet; not just on the US but on the world's countries. Try to find GDP data for the same time as well. Try to align this data to explore patterns. How are GDP and population growth related? Are there any special cases. such as countries with high GDP but low population growth or countries with the opposite history?