Python: End-to-end Data Analysis

By : Ivan Idris, Luiz Felipe Martins, Martin Czygan, Phuong Vo.T.H, Magnus Vilhelm Persson

Python: End-to-end Data Analysis

By: Ivan Idris, Luiz Felipe Martins, Martin Czygan, Phuong Vo.T.H, Magnus Vilhelm Persson

Overview of this book

Data analysis is the process of applying logical and analytical reasoning to study each component of data present in the system. Python is a multi-domain, high-level, programming language that offers a range of tools and libraries suitable for all purposes, it has slowly evolved as one of the primary languages for data science. Have you ever imagined becoming an expert at effectively approaching data analysis problems, solving them, and extracting all of the available information from your data? If yes, look no further, this is the course you need! In this course, we will get you started with Python data analysis by introducing the basics of data analysis and supported Python libraries such as matplotlib, NumPy, and pandas. Create visualizations by choosing color maps, different shapes, sizes, and palettes then delve into statistical data analysis using distribution algorithms and correlations. You’ll then find your way around different data and numerical problems, get to grips with Spark and HDFS, and set up migration scripts for web mining. You’ll be able to quickly and accurately perform hands-on sorting, reduction, and subsequent analysis, and fully appreciate how data analysis methods can support business decision-making. Finally, you will delve into advanced techniques such as performing regression, quantifying cause and effect using Bayesian methods, and discovering how to use Python’s tools for supervised machine learning. The course provides you with highly practical content explaining data analysis with Python, from the following Packt books: 1. Getting Started with Python Data Analysis. 2. Python Data Analysis Cookbook. 3. Mastering Python Data Analysis. By the end of this course, you will have all the knowledge you need to analyze your data with varying complexity levels, and turn it into actionable insights.

Preface

What this learning path covers

What you need for this learning path

Who this learning path is for

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Free Chapter

1. Module 1

1. Introducing Data Analysis and Libraries

2. NumPy Arrays and Vectorized Computation

3. Data Analysis with Pandas

4. Data Visualization

5. Time Series

6. Interacting with Databases

7. Data Analysis Application Examples

8. Machine Learning Models with scikit-learn

2. Module 2

1. Laying the Foundation for Reproducible Data Analysis

2. Creating Attractive Data Visualizations

3. Statistical Data Analysis and Probability

4. Dealing with Data and Numerical Issues

5. Web Mining, Databases, and Big Data

6. Signal Processing and Timeseries

7. Selecting Stocks with Financial Data Analysis

8. Text Mining and Social Network Analysis

9. Ensemble Learning and Dimensionality Reduction

10. Evaluating Classifiers, Regressors, and Clusters

11. Analyzing Images

12. Parallelism and Performance

A. Glossary

B. Function Reference

C. Online Resources

D. Tips and Tricks for Command-Line and Miscellaneous Tools

3. Module 3

1. Tools of the Trade

2. Exploring Data

3. Learning About Models

4. Regression

5. Clustering

6. Bayesian Methods

7. Supervised and Unsupervised Learning

8. Time Series Analysis

E. More on Jupyter Notebook and matplotlib Styles

A. Bibliography

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Chapter 3. Data Analysis with Pandas

In this chapter, we will explore another data analysis library called Pandas. The goal of this chapter is to give you some basic knowledge and concrete examples for getting started with Pandas.

An overview of the Pandas package

Pandas is a Python package that supports fast, flexible, and expressive data structures, as well as computing functions for data analysis. The following are some prominent features that Pandas supports:

Data structure with labeled axes. This makes the program clean and clear and avoids common errors from misaligned data.
Flexible handling of missing data.
Intelligent label-based slicing, fancy indexing, and subset creation of large datasets.
Powerful arithmetic operations and statistical computations on a custom axis via axis label.
Robust input and output support for loading or saving data from and to files, databases, or HDF5 format.

After installation, we can use it like other Python packages. Firstly, we have to import the following packages at the beginning of the program:

>>> import pandas as pd
>>> import numpy as np

The Pandas data structure

Let's first get acquainted with two of Pandas' primary data structures: the Series and the DataFrame. They can handle the majority of use cases in finance, statistic, social science, and many areas of engineering.

Series

A Series is a one-dimensional object similar to an array, list, or column in table. Each item in a Series is assigned to an entry in an index:

>>> s1 = pd.Series(np.random.rand(4),
                   index=['a', 'b', 'c', 'd'])
>>> s1
a    0.6122
b    0.98096
c    0.3350
d    0.7221
dtype: float64

By default, if no index is passed, it will be created to have values ranging from 0 to N-1, where N is the length of the Series:

>>> s2 = pd.Series(np.random.rand(4))
>>> s2
0    0.6913
1    0.8487
2    0.8627
3    0.7286
dtype: float64

We can access the value of a Series by using the index:

>>> s1['c']
0.3350
>>>s1['c'] = 3.14
>>> s1['c', 'a', 'b']
c    3.14
a    0.6122
b    0.98096

This accessing method is similar to a Python dictionary. Therefore, Pandas also allows us to initialize a Series object directly from a Python dictionary:

>>> s3 = pd.Series({'001': 'Nam', '002': 'Mary',
                    '003': 'Peter'})
>>> s3
001    Nam
002    Mary
003    Peter
dtype: object

Sometimes, we want to filter or rename the index of a Series created from a Python dictionary. At such times, we can pass the selected index list directly to the initial function, similarly to the process in the above example. Only elements that exist in the index list will be in the Series object. Conversely, indexes that are missing in the dictionary are initialized to default NaN values by Pandas:

>>> s4 = pd.Series({'001': 'Nam', '002': 'Mary',
                    '003': 'Peter'}, index=[
                    '002', '001', '024', '065'])
>>> s4
002    Mary
001    Nam
024    NaN
065    NaN
dtype:   object
ect

The library also supports functions that detect missing data:

>>> pd.isnull(s4)
002    False
001    False
024    True
065    True
dtype: bool

>>> s5 = pd.Series(2.71, index=['x', 'y'])
>>> s5
x    2.71
y    2.71
dtype: float64

A Series object can be initialized with NumPy objects as well, such as ndarray. Moreover, Pandas can automatically align data indexed in different ways in arithmetic operations:

>>> s6 = pd.Series(np.array([2.71, 3.14]), index=['z', 'y'])
>>> s6
z    2.71
y    3.14
dtype: float64
>>> s5 + s6
x    NaN
y    5.85
z    NaN
dtype: float64

The DataFrame

The DataFrame is a tabular data structure comprising a set of ordered columns and rows. It can be thought of as a group of Series objects that share an index (the column names). There are a number of ways to initialize a DataFrame object. Firstly, let's take a look at the common example of creating DataFrame from a dictionary of lists:

>>> data = {'Year': [2000, 2005, 2010, 2014],
         'Median_Age': [24.2, 26.4, 28.5, 30.3],
         'Density': [244, 256, 268, 279]}
>>> df1 = pd.DataFrame(data)
>>> df1
    Density    Median_Age    Year
0  244        24.2        2000
1  256        26.4        2005
2  268        28.5        2010
3  279        30.3        2014

By default, the DataFrame constructor will order the column alphabetically. We can edit the default order by passing the column's attribute to the initializing function:

>>> df2 = pd.DataFrame(data, columns=['Year', 'Density', 
                                      'Median_Age'])
>>> df2
    Year    Density    Median_Age
0    2000    244        24.2
1    2005    256        26.4
2    2010    268        28.5
3    2014    279        30.3
>>> df2.index
Int64Index([0, 1, 2, 3], dtype='int64')

We can provide the index labels of a DataFrame similar to a Series:

>>> df3 = pd.DataFrame(data, columns=['Year', 'Density',  
                   'Median_Age'], index=['a', 'b', 'c', 'd'])
>>> df3.index
Index([u'a', u'b', u'c', u'd'], dtype='object')

We can construct a DataFrame out of nested lists as well:

>>> df4 = pd.DataFrame([
    ['Peter', 16, 'pupil', 'TN', 'M', None],
    ['Mary', 21, 'student', 'SG', 'F', None],
    ['Nam', 22, 'student', 'HN', 'M', None],
    ['Mai', 31, 'nurse', 'SG', 'F', None],
    ['John', 28, 'laywer', 'SG', 'M', None]],
columns=['name', 'age', 'career', 'province', 'sex', 'award'])

Columns can be accessed by column name as a Series can, either by dictionary-like notation or as an attribute, if the column name is a syntactically valid attribute name:

>>> df4.name    # or df4['name'] 
0    Peter
1    Mary
2    Nam
3    Mai
4    John
Name: name, dtype: object

To modify or append a new column to the created DataFrame, we specify the column name and the value we want to assign:

>>> df4['award'] = None
>>> df4
    name age   career province  sex award
0  Peter  16    pupil       TN    M  None
1    Mary  21  student       SG    F  None
2    Nam   22  student       HN  M  None
3    Mai    31    nurse        SG    F    None
4    John    28    lawer        SG    M    None

Using a couple of methods, rows can be retrieved by position or name:

>>> df4.ix[1]
name           Mary
age              21
career      student
province         SG
sex               F
award          None
Name: 1, dtype: object

A DataFrame object can also be created from different data structures such as a list of dictionaries, a dictionary of Series, or a record array. The method to initialize a DataFrame object is similar to the examples above.

Another common case is to provide a DataFrame with data from a location such as a text file. In this situation, we use the read_csv function that expects the column separator to be a comma, by default. However, we can change that by using the sep parameter:

# person.csv file
name,age,career,province,sex
Peter,16,pupil,TN,M
Mary,21,student,SG,F
Nam,22,student,HN,M
Mai,31,nurse,SG,F
John,28,lawer,SG,M
# loading person.cvs into a DataFrame
>>> df4 = pd.read_csv('person.csv')
>>> df4
     name   age   career   province  sex
0    Peter    16    pupil       TN        M
1    Mary     21    student     SG       F
2    Nam      22    student     HN       M
3    Mai      31    nurse       SG       F
4    John     28    laywer      SG       M

While reading a data file, we sometimes want to skip a line or an invalid value. As for Pandas 0.16.2, read_csv supports over 50 parameters for controlling the loading process. Some common useful parameters are as follows:

sep: This is a delimiter between columns. The default is comma symbol.
dtype: This is a data type for data or columns.
header: This sets row numbers to use as the column names.
skiprows: This skips line numbers to skip at the start of the file.
error_bad_lines: This shows invalid lines (too many fields) that will, by default, cause an exception, such that no DataFrame will be returned. If we set the value of this parameter as false, the bad lines will be skipped.

Moreover, Pandas also has support for reading and writing a DataFrame directly from or to a database such as the read_frame or write_frame function within the Pandas module. We will come back to these methods later in this chapter.

Series

Series is a one-dimensional object similar to an array, list, or column in table. Each item in a Series is assigned to an entry in an index:

>>> s1 = pd.Series(np.random.rand(4),
                   index=['a', 'b', 'c', 'd'])
>>> s1
a    0.6122
b    0.98096
c    0.3350
d    0.7221
dtype: float64

By default, if no index is passed, it will be created to have values ranging from 0 to N-1, where N is the length of the Series:

>>> s2 = pd.Series(np.random.rand(4))
>>> s2
0    0.6913
1    0.8487
2    0.8627
3    0.7286
dtype: float64

We can access the value of a Series by using the index:

>>> s1['c']
0.3350
>>>s1['c'] = 3.14
>>> s1['c', 'a', 'b']
c    3.14
a    0.6122
b    0.98096

This accessing method is similar to a Python dictionary. Therefore, Pandas also allows us to initialize a Series object directly from a Python dictionary:

>>> s3 = pd.Series({'001': 'Nam', '002': 'Mary',
                    '003': 'Peter'})
>>> s3
001    Nam
002    Mary
003    Peter
dtype: object

>>> s4 = pd.Series({'001': 'Nam', '002': 'Mary',
                    '003': 'Peter'}, index=[
                    '002', '001', '024', '065'])
>>> s4
002    Mary
001    Nam
024    NaN
065    NaN
dtype:   object
ect

The library also supports functions that detect missing data:

>>> pd.isnull(s4)
002    False
001    False
024    True
065    True
dtype: bool

>>> s5 = pd.Series(2.71, index=['x', 'y'])
>>> s5
x    2.71
y    2.71
dtype: float64

A Series object can be initialized with NumPy objects as well, such as ndarray. Moreover, Pandas can automatically align data indexed in different ways in arithmetic operations:

>>> s6 = pd.Series(np.array([2.71, 3.14]), index=['z', 'y'])
>>> s6
z    2.71
y    3.14
dtype: float64
>>> s5 + s6
x    NaN
y    5.85
z    NaN
dtype: float64

The DataFrame

>>> data = {'Year': [2000, 2005, 2010, 2014],
         'Median_Age': [24.2, 26.4, 28.5, 30.3],
         'Density': [244, 256, 268, 279]}
>>> df1 = pd.DataFrame(data)
>>> df1
    Density    Median_Age    Year
0  244        24.2        2000
1  256        26.4        2005
2  268        28.5        2010
3  279        30.3        2014

By default, the DataFrame constructor will order the column alphabetically. We can edit the default order by passing the column's attribute to the initializing function:

>>> df2 = pd.DataFrame(data, columns=['Year', 'Density', 
                                      'Median_Age'])
>>> df2
    Year    Density    Median_Age
0    2000    244        24.2
1    2005    256        26.4
2    2010    268        28.5
3    2014    279        30.3
>>> df2.index
Int64Index([0, 1, 2, 3], dtype='int64')

We can provide the index labels of a DataFrame similar to a Series:

>>> df3 = pd.DataFrame(data, columns=['Year', 'Density',  
                   'Median_Age'], index=['a', 'b', 'c', 'd'])
>>> df3.index
Index([u'a', u'b', u'c', u'd'], dtype='object')

We can construct a DataFrame out of nested lists as well:

>>> df4 = pd.DataFrame([
    ['Peter', 16, 'pupil', 'TN', 'M', None],
    ['Mary', 21, 'student', 'SG', 'F', None],
    ['Nam', 22, 'student', 'HN', 'M', None],
    ['Mai', 31, 'nurse', 'SG', 'F', None],
    ['John', 28, 'laywer', 'SG', 'M', None]],
columns=['name', 'age', 'career', 'province', 'sex', 'award'])

Columns can be accessed by column name as a Series can, either by dictionary-like notation or as an attribute, if the column name is a syntactically valid attribute name:

>>> df4.name    # or df4['name'] 
0    Peter
1    Mary
2    Nam
3    Mai
4    John
Name: name, dtype: object

To modify or append a new column to the created DataFrame, we specify the column name and the value we want to assign:

>>> df4['award'] = None
>>> df4
    name age   career province  sex award
0  Peter  16    pupil       TN    M  None
1    Mary  21  student       SG    F  None
2    Nam   22  student       HN  M  None
3    Mai    31    nurse        SG    F    None
4    John    28    lawer        SG    M    None

Using a couple of methods, rows can be retrieved by position or name:

>>> df4.ix[1]
name           Mary
age              21
career      student
province         SG
sex               F
award          None
Name: 1, dtype: object

# person.csv file
name,age,career,province,sex
Peter,16,pupil,TN,M
Mary,21,student,SG,F
Nam,22,student,HN,M
Mai,31,nurse,SG,F
John,28,lawer,SG,M
# loading person.cvs into a DataFrame
>>> df4 = pd.read_csv('person.csv')
>>> df4
     name   age   career   province  sex
0    Peter    16    pupil       TN        M
1    Mary     21    student     SG       F
2    Nam      22    student     HN       M
3    Mai      31    nurse       SG       F
4    John     28    laywer      SG       M

sep: This is a delimiter between columns. The default is comma symbol.
dtype: This is a data type for data or columns.
header: This sets row numbers to use as the column names.
skiprows: This skips line numbers to skip at the start of the file.
error_bad_lines: This shows invalid lines (too many fields) that will, by default, cause an exception, such that no DataFrame will be returned. If we set the value of this parameter as false, the bad lines will be skipped.

The DataFrame

The

DataFrame is a tabular data structure comprising a set of ordered columns and rows. It can be thought of as a group of Series objects that share an index (the column names). There are a number of ways to initialize a DataFrame object. Firstly, let's take a look at the common example of creating DataFrame from a dictionary of lists:

>>> data = {'Year': [2000, 2005, 2010, 2014],
         'Median_Age': [24.2, 26.4, 28.5, 30.3],
         'Density': [244, 256, 268, 279]}
>>> df1 = pd.DataFrame(data)
>>> df1
    Density    Median_Age    Year
0  244        24.2        2000
1  256        26.4        2005
2  268        28.5        2010
3  279        30.3        2014

By default, the DataFrame constructor will order the column alphabetically. We can edit the default order by passing the column's attribute to the initializing function:

>>> df2 = pd.DataFrame(data, columns=['Year', 'Density', 
                                      'Median_Age'])
>>> df2
    Year    Density    Median_Age
0    2000    244        24.2
1    2005    256        26.4
2    2010    268        28.5
3    2014    279        30.3
>>> df2.index
Int64Index([0, 1, 2, 3], dtype='int64')

We can provide the index labels of a DataFrame similar to a Series:

>>> df3 = pd.DataFrame(data, columns=['Year', 'Density',  
                   'Median_Age'], index=['a', 'b', 'c', 'd'])
>>> df3.index
Index([u'a', u'b', u'c', u'd'], dtype='object')

We can construct a DataFrame out of nested lists as well:

>>> df4 = pd.DataFrame([
    ['Peter', 16, 'pupil', 'TN', 'M', None],
    ['Mary', 21, 'student', 'SG', 'F', None],
    ['Nam', 22, 'student', 'HN', 'M', None],
    ['Mai', 31, 'nurse', 'SG', 'F', None],
    ['John', 28, 'laywer', 'SG', 'M', None]],
columns=['name', 'age', 'career', 'province', 'sex', 'award'])

Columns can be accessed by column name as a Series can, either by dictionary-like notation or as an attribute, if the column name is a syntactically valid attribute name:

>>> df4.name    # or df4['name'] 
0    Peter
1    Mary
2    Nam
3    Mai
4    John
Name: name, dtype: object

To modify or append a new column to the created DataFrame, we specify the column name and the value we want to assign:

>>> df4['award'] = None
>>> df4
    name age   career province  sex award
0  Peter  16    pupil       TN    M  None
1    Mary  21  student       SG    F  None
2    Nam   22  student       HN  M  None
3    Mai    31    nurse        SG    F    None
4    John    28    lawer        SG    M    None

Using a couple of methods, rows can be retrieved by position or name:

>>> df4.ix[1]
name           Mary
age              21
career      student
province         SG
sex               F
award          None
Name: 1, dtype: object

# person.csv file
name,age,career,province,sex
Peter,16,pupil,TN,M
Mary,21,student,SG,F
Nam,22,student,HN,M
Mai,31,nurse,SG,F
John,28,lawer,SG,M
# loading person.cvs into a DataFrame
>>> df4 = pd.read_csv('person.csv')
>>> df4
     name   age   career   province  sex
0    Peter    16    pupil       TN        M
1    Mary     21    student     SG       F
2    Nam      22    student     HN       M
3    Mai      31    nurse       SG       F
4    John     28    laywer      SG       M

sep: This is a delimiter between columns. The default is comma symbol.
dtype: This is a data type for data or columns.
header: This sets row numbers to use as the column names.
skiprows: This skips line numbers to skip at the start of the file.
error_bad_lines: This shows invalid lines (too many fields) that will, by default, cause an exception, such that no DataFrame will be returned. If we set the value of this parameter as false, the bad lines will be skipped.

The essential basic functionality

Pandas supports many essential functionalities that are useful to manipulate Pandas data structures. In this book, we will focus on the most important features regarding exploration and analysis.

Reindexing and altering labels

Reindex is a critical method in the Pandas data structures. It confirms whether the new or modified data satisfies a given set of labels along a particular axis of Pandas object.

First, let's view a reindex example on a Series object:

>>> s2.reindex([0, 2, 'b', 3])
0    0.6913
2    0.8627
b    NaN
3    0.7286
dtype: float64

When reindexed labels do not exist in the data object, a default value of NaN will be automatically assigned to the position; this holds true for the DataFrame case as well:

>>> df1.reindex(index=[0, 2, 'b', 3],
        columns=['Density', 'Year', 'Median_Age','C'])
   Density  Year  Median_Age        C
0      244  2000        24.2      NaN
2      268  2010        28.5      NaN
b      NaN   NaN         NaN      NaN
3      279  2014        30.3      NaN

We can change the NaN value in the missing index case to a custom value by setting the fill_value parameter. Let us take a look at the arguments that the reindex function supports, as shown in the following table:

Argument	Description
`index`	This is the new labels/index to conform to.
`method`	This is the method to use for filling holes in a `reindexed` object. The default setting is unfill gaps. `pad/ffill`: fill values forward `backfill`/`bfill`: fill values backward `nearest`: use the nearest value to fill the gap
`copy`	This return a new object. The default setting is `true`.
`level`	The matches index values on the passed multiple index level.
`fill_value`	This is the value to use for missing values. The default setting is `NaN`.
`limit`	This is the maximum size gap to fill in `forward` or `backward` method.

Head and tail

In common data analysis situations, our data structure objects contain many columns and a large number of rows. Therefore, we cannot view or load all information of the objects. Pandas supports functions that allow us to inspect a small sample. By default, the functions return five elements, but we can set a custom number as well. The following example shows how to display the first five and the last three rows of a longer Series:

>>> s7 = pd.Series(np.random.rand(10000))
>>> s7.head()
0    0.631059
1    0.766085
2    0.066891
3    0.867591
4    0.339678
dtype: float64
>>> s7.tail(3)
9997    0.412178
9998    0.800711
9999    0.438344
dtype: float64

We can also use these functions for DataFrame objects in the same way.

Binary operations

Firstly, we will consider arithmetic operations between objects. In different indexes objects case, the expected result will be the union of the index pairs. We will not explain this again because we had an example about it in the above section (s5 + s6). This time, we will show another example with a DataFrame:

>>> df5 = pd.DataFrame(np.arange(9).reshape(3,3),0
                       columns=['a','b','c'])
>>> df5
   a  b  c
0  0  1  2
1  3  4  5
2  6  7  8
>>> df6 = pd.DataFrame(np.arange(8).reshape(2,4), 
                      columns=['a','b','c','d'])
>>> df6
   a  b  c  d
0  0  1  2  3
1  4  5  6  7
>>> df5 + df6
    a   b   c   d
0   0   2   4 NaN
1   7   9  11 NaN
2   NaN NaN NaN NaN

The mechanisms for returning the result between two kinds of data structure are similar. A problem that we need to consider is the missing data between objects. In this case, if we want to fill with a fixed value, such as 0, we can use the arithmetic functions such as add, sub, div, and mul, and the function's supported parameters such as fill_value:

>>> df7 = df5.add(df6, fill_value=0)
>>> df7
   a  b   c   d
0  0  2   4   3
1  7  9  11   7
2  6  7   8   NaN

Next, we will discuss comparison operations between data objects. We have some supported functions such as equal (eq), not equal (ne), greater than (gt), less than (lt), less equal (le), and greater equal (ge). Here is an example:

>>> df5.eq(df6)
       a      b      c      d
0   True   True   True  False
1  False  False  False  False
2  False  False  False  False

Functional statistics

The supported statistics method of a library is really important in data analysis. To get inside a big data object, we need to know some summarized information such as mean, sum, or quantile. Pandas supports a large number of methods to compute them. Let's consider a simple example of calculating the sum information of df5, which is a DataFrame object:

>>> df5.sum()
a     9
b    12
c    15
dtype: int64

When we do not specify which axis we want to calculate sum information, by default, the function will calculate on index axis, which is axis 0:

Series: We do not need to specify the axis.
DataFrame: Columns (axis = 1) or index (axis = 0). The default setting is axis 0.

We also have the skipna parameter that allows us to decide whether to exclude missing data or not. By default, it is set as true:

>>> df7.sum(skipna=False)
a    13
b    18
c    23
d   NaN
dtype: float64

Another function that we want to consider is describe(). It is very convenient for us to summarize most of the statistical information of a data structure such as the Series and DataFrame, as well:

>>> df5.describe()
         a    b    c
count  3.0  3.0  3.0
mean   3.0  4.0  5.0
std    3.0  3.0  3.0
min    0.0  1.0  2.0
25%    1.5  2.5  3.5
50%    3.0  4.0  5.0
75%    4.5  5.5  6.5
max    6.0  7.0  8.0

We can specify percentiles to include or exclude in the output by using the percentiles parameter; for example, consider the following:

>>> df5.describe(percentiles=[0.5, 0.8])
         a    b    c
count  3.0  3.0  3.0
mean   3.0  4.0  5.0
std    3.0  3.0  3.0
min    0.0  1.0  2.0
50%    3.0  4.0  5.0
80%    4.8  5.8  6.8
max    6.0  7.0  8.0

Here, we have a summary table for common supported statistics functions in Pandas:

Function	Description
`idxmin(axis)`, `idxmax(axis)`	This compute the index labels with the minimum or maximum corresponding values.
`value_counts()`	This compute the frequency of unique values.
`count()`	This return the number of non-null values in a data object.
`mean()`, `median()`, `min()`, `max()`	This return mean, median, minimum, and maximum values of an axis in a data object.
`std()`, `var()`, `sem()`	These return the standard deviation, variance, and standard error of mean.
`abs()`	This gets the absolute value of a data object.

Function application

Pandas supports function application that allows us to apply some functions supported in other packages such as NumPy or our own functions on data structure objects. Here, we illustrate two examples of these cases, firstly, using apply to execute the std() function, which is the standard deviation calculating function of the NumPy package:

>>> df5.apply(np.std, axis=1)    # default: axis=0
0    0.816497
1    0.816497
2    0.816497
dtype: float64

Secondly, if we want to apply a formula to a data object, we can also useapply function by following these steps:

Define the function or formula that you want to apply on a data object.

Call the defined function or formula via apply. In this step, we also need to figure out the axis that we want to apply the calculation to:

>>> f = lambda x: x.max() – x.min()    # step 1
>>> df5.apply(f, axis=1)               # step 2
0    2
1    2
2    2
dtype: int64
>>> def sigmoid(x):
    return 1/(1 + np.exp(x))
>>> df5.apply(sigmoid)
     a           b         c
0  0.500000  0.268941  0.119203
1  0.047426  0.017986  0.006693
2  0.002473  0.000911  0.000335

Sorting

There are two kinds of sorting method that we are interested in: sorting by row or column index and sorting by data value.

Firstly, we will consider methods for sorting by row and column index. In this case, we have the sort_index () function. We also have axis parameter to set whether the function should sort by row or column. The ascending option with the true or false value will allow us to sort data in ascending or descending order. The default setting for this option is true:

>>> df7 = pd.DataFrame(np.arange(12).reshape(3,4),  
                       columns=['b', 'd', 'a', 'c'],
                       index=['x', 'y', 'z'])
>>> df7
   b  d   a   c
x  0  1   2   3
y  4  5   6   7
z  8  9  10  11
>>> df7.sort_index(axis=1)
    a  b   c  d
x   2  0   3  1
y   6  4   7  5
z  10  8  11  9

Series has a method order that sorts by value. For NaN values in the object, we can also have a special treatment via the na_position option:

>>> s4.order(na_position='first')
024     NaN
065     NaN
002    Mary
001     Nam
dtype: object
>>> s4
002    Mary
001     Nam
024     NaN
065     NaN
dtype: object

Besides that, Series also has the sort() function that sorts data by value. However, the function will not return a copy of the sorted data:

>>> s4.sort(na_position='first')
>>> s4
024     NaN
065     NaN
002    Mary
001     Nam
dtype: object

If we want to apply sort function to a DataFrame object, we need to figure out which columns or rows will be sorted:

>>> df7.sort(['b', 'd'], ascending=False)
   b  d   a   c
z  8  9  10  11
y  4  5   6   7
x  0  1   2   3

If we do not want to automatically save the sorting result to the current data object, we can change the setting of the inplace parameter to False.

Reindexing and altering labels

Reindex

is a critical method in the Pandas data structures. It confirms whether the new or modified data satisfies a given set of labels along a particular axis of Pandas object.

First, let's view a reindex example on a Series object:

>>> s2.reindex([0, 2, 'b', 3])
0    0.6913
2    0.8627
b    NaN
3    0.7286
dtype: float64

When reindexed labels do not exist in the data object, a default value of NaN will be automatically assigned to the position; this holds true for the DataFrame case as well:

>>> df1.reindex(index=[0, 2, 'b', 3],
        columns=['Density', 'Year', 'Median_Age','C'])
   Density  Year  Median_Age        C
0      244  2000        24.2      NaN
2      268  2010        28.5      NaN
b      NaN   NaN         NaN      NaN
3      279  2014        30.3      NaN

Argument	Description
`index`	This is the new labels/index to conform to.
`method`	This is the method to use for filling holes in a `reindexed` object. The default setting is unfill gaps. `pad/ffill`: fill values forward `backfill`/`bfill`: fill values backward `nearest`: use the nearest value to fill the gap
`copy`	This return a new object. The default setting is `true`.
`level`	The matches index values on the passed multiple index level.
`fill_value`	This is the value to use for missing values. The default setting is `NaN`.
`limit`	This is the maximum size gap to fill in `forward` or `backward` method.

Head and tail

>>> s7 = pd.Series(np.random.rand(10000))
>>> s7.head()
0    0.631059
1    0.766085
2    0.066891
3    0.867591
4    0.339678
dtype: float64
>>> s7.tail(3)
9997    0.412178
9998    0.800711
9999    0.438344
dtype: float64

We can also use these functions for DataFrame objects in the same way.

Binary operations

>>> df5 = pd.DataFrame(np.arange(9).reshape(3,3),0
                       columns=['a','b','c'])
>>> df5
   a  b  c
0  0  1  2
1  3  4  5
2  6  7  8
>>> df6 = pd.DataFrame(np.arange(8).reshape(2,4), 
                      columns=['a','b','c','d'])
>>> df6
   a  b  c  d
0  0  1  2  3
1  4  5  6  7
>>> df5 + df6
    a   b   c   d
0   0   2   4 NaN
1   7   9  11 NaN
2   NaN NaN NaN NaN

>>> df7 = df5.add(df6, fill_value=0)
>>> df7
   a  b   c   d
0  0  2   4   3
1  7  9  11   7
2  6  7   8   NaN

>>> df5.eq(df6)
       a      b      c      d
0   True   True   True  False
1  False  False  False  False
2  False  False  False  False

Functional statistics

>>> df5.sum()
a     9
b    12
c    15
dtype: int64

When we do not specify which axis we want to calculate sum information, by default, the function will calculate on index axis, which is axis 0:

Series: We do not need to specify the axis.
DataFrame: Columns (axis = 1) or index (axis = 0). The default setting is axis 0.

We also have the skipna parameter that allows us to decide whether to exclude missing data or not. By default, it is set as true:

>>> df7.sum(skipna=False)
a    13
b    18
c    23
d   NaN
dtype: float64

Another function that we want to consider is describe(). It is very convenient for us to summarize most of the statistical information of a data structure such as the Series and DataFrame, as well:

>>> df5.describe()
         a    b    c
count  3.0  3.0  3.0
mean   3.0  4.0  5.0
std    3.0  3.0  3.0
min    0.0  1.0  2.0
25%    1.5  2.5  3.5
50%    3.0  4.0  5.0
75%    4.5  5.5  6.5
max    6.0  7.0  8.0

We can specify percentiles to include or exclude in the output by using the percentiles parameter; for example, consider the following:

>>> df5.describe(percentiles=[0.5, 0.8])
         a    b    c
count  3.0  3.0  3.0
mean   3.0  4.0  5.0
std    3.0  3.0  3.0
min    0.0  1.0  2.0
50%    3.0  4.0  5.0
80%    4.8  5.8  6.8
max    6.0  7.0  8.0

Here, we have a summary table for common supported statistics functions in Pandas:

Function	Description
`idxmin(axis)`, `idxmax(axis)`	This compute the index labels with the minimum or maximum corresponding values.
`value_counts()`	This compute the frequency of unique values.
`count()`	This return the number of non-null values in a data object.
`mean()`, `median()`, `min()`, `max()`	This return mean, median, minimum, and maximum values of an axis in a data object.
`std()`, `var()`, `sem()`	These return the standard deviation, variance, and standard error of mean.
`abs()`	This gets the absolute value of a data object.

Function application

>>> df5.apply(np.std, axis=1)    # default: axis=0
0    0.816497
1    0.816497
2    0.816497
dtype: float64

Secondly, if we want to apply a formula to a data object, we can also useapply function by following these steps:

Define the function or formula that you want to apply on a data object.

Call the defined function or formula via apply. In this step, we also need to figure out the axis that we want to apply the calculation to:

>>> f = lambda x: x.max() – x.min()    # step 1
>>> df5.apply(f, axis=1)               # step 2
0    2
1    2
2    2
dtype: int64
>>> def sigmoid(x):
    return 1/(1 + np.exp(x))
>>> df5.apply(sigmoid)
     a           b         c
0  0.500000  0.268941  0.119203
1  0.047426  0.017986  0.006693
2  0.002473  0.000911  0.000335

Sorting

There are two kinds of sorting method that we are interested in: sorting by row or column index and sorting by data value.

>>> df7 = pd.DataFrame(np.arange(12).reshape(3,4),  
                       columns=['b', 'd', 'a', 'c'],
                       index=['x', 'y', 'z'])
>>> df7
   b  d   a   c
x  0  1   2   3
y  4  5   6   7
z  8  9  10  11
>>> df7.sort_index(axis=1)
    a  b   c  d
x   2  0   3  1
y   6  4   7  5
z  10  8  11  9

Series has a method order that sorts by value. For NaN values in the object, we can also have a special treatment via the na_position option:

>>> s4.order(na_position='first')
024     NaN
065     NaN
002    Mary
001     Nam
dtype: object
>>> s4
002    Mary
001     Nam
024     NaN
065     NaN
dtype: object

Besides that, Series also has the sort() function that sorts data by value. However, the function will not return a copy of the sorted data:

>>> s4.sort(na_position='first')
>>> s4
024     NaN
065     NaN
002    Mary
001     Nam
dtype: object

If we want to apply sort function to a DataFrame object, we need to figure out which columns or rows will be sorted:

>>> df7.sort(['b', 'd'], ascending=False)
   b  d   a   c
z  8  9  10  11
y  4  5   6   7
x  0  1   2   3

If we do not want to automatically save the sorting result to the current data object, we can change the setting of the inplace parameter to False.

Head and tail

In common

data analysis situations, our data structure objects contain many columns and a large number of rows. Therefore, we cannot view or load all information of the objects. Pandas supports functions that allow us to inspect a small sample. By default, the functions return five elements, but we can set a custom number as well. The following example shows how to display the first five and the last three rows of a longer Series:

>>> s7 = pd.Series(np.random.rand(10000))
>>> s7.head()
0    0.631059
1    0.766085
2    0.066891
3    0.867591
4    0.339678
dtype: float64
>>> s7.tail(3)
9997    0.412178
9998    0.800711
9999    0.438344
dtype: float64

We can also use these functions for DataFrame objects in the same way.

Binary operations

>>> df5 = pd.DataFrame(np.arange(9).reshape(3,3),0
                       columns=['a','b','c'])
>>> df5
   a  b  c
0  0  1  2
1  3  4  5
2  6  7  8
>>> df6 = pd.DataFrame(np.arange(8).reshape(2,4), 
                      columns=['a','b','c','d'])
>>> df6
   a  b  c  d
0  0  1  2  3
1  4  5  6  7
>>> df5 + df6
    a   b   c   d
0   0   2   4 NaN
1   7   9  11 NaN
2   NaN NaN NaN NaN

>>> df7 = df5.add(df6, fill_value=0)
>>> df7
   a  b   c   d
0  0  2   4   3
1  7  9  11   7
2  6  7   8   NaN

>>> df5.eq(df6)
       a      b      c      d
0   True   True   True  False
1  False  False  False  False
2  False  False  False  False

Functional statistics

>>> df5.sum()
a     9
b    12
c    15
dtype: int64

When we do not specify which axis we want to calculate sum information, by default, the function will calculate on index axis, which is axis 0:

Series: We do not need to specify the axis.
DataFrame: Columns (axis = 1) or index (axis = 0). The default setting is axis 0.

We also have the skipna parameter that allows us to decide whether to exclude missing data or not. By default, it is set as true:

>>> df7.sum(skipna=False)
a    13
b    18
c    23
d   NaN
dtype: float64

Another function that we want to consider is describe(). It is very convenient for us to summarize most of the statistical information of a data structure such as the Series and DataFrame, as well:

>>> df5.describe()
         a    b    c
count  3.0  3.0  3.0
mean   3.0  4.0  5.0
std    3.0  3.0  3.0
min    0.0  1.0  2.0
25%    1.5  2.5  3.5
50%    3.0  4.0  5.0
75%    4.5  5.5  6.5
max    6.0  7.0  8.0

We can specify percentiles to include or exclude in the output by using the percentiles parameter; for example, consider the following:

>>> df5.describe(percentiles=[0.5, 0.8])
         a    b    c
count  3.0  3.0  3.0
mean   3.0  4.0  5.0
std    3.0  3.0  3.0
min    0.0  1.0  2.0
50%    3.0  4.0  5.0
80%    4.8  5.8  6.8
max    6.0  7.0  8.0

Here, we have a summary table for common supported statistics functions in Pandas:

Function	Description
`idxmin(axis)`, `idxmax(axis)`	This compute the index labels with the minimum or maximum corresponding values.
`value_counts()`	This compute the frequency of unique values.
`count()`	This return the number of non-null values in a data object.
`mean()`, `median()`, `min()`, `max()`	This return mean, median, minimum, and maximum values of an axis in a data object.
`std()`, `var()`, `sem()`	These return the standard deviation, variance, and standard error of mean.
`abs()`	This gets the absolute value of a data object.

Function application

>>> df5.apply(np.std, axis=1)    # default: axis=0
0    0.816497
1    0.816497
2    0.816497
dtype: float64

Secondly, if we want to apply a formula to a data object, we can also useapply function by following these steps:

Define the function or formula that you want to apply on a data object.

Call the defined function or formula via apply. In this step, we also need to figure out the axis that we want to apply the calculation to:

>>> f = lambda x: x.max() – x.min()    # step 1
>>> df5.apply(f, axis=1)               # step 2
0    2
1    2
2    2
dtype: int64
>>> def sigmoid(x):
    return 1/(1 + np.exp(x))
>>> df5.apply(sigmoid)
     a           b         c
0  0.500000  0.268941  0.119203
1  0.047426  0.017986  0.006693
2  0.002473  0.000911  0.000335

Sorting

There are two kinds of sorting method that we are interested in: sorting by row or column index and sorting by data value.

>>> df7 = pd.DataFrame(np.arange(12).reshape(3,4),  
                       columns=['b', 'd', 'a', 'c'],
                       index=['x', 'y', 'z'])
>>> df7
   b  d   a   c
x  0  1   2   3
y  4  5   6   7
z  8  9  10  11
>>> df7.sort_index(axis=1)
    a  b   c  d
x   2  0   3  1
y   6  4   7  5
z  10  8  11  9

Series has a method order that sorts by value. For NaN values in the object, we can also have a special treatment via the na_position option:

>>> s4.order(na_position='first')
024     NaN
065     NaN
002    Mary
001     Nam
dtype: object
>>> s4
002    Mary
001     Nam
024     NaN
065     NaN
dtype: object

Besides that, Series also has the sort() function that sorts data by value. However, the function will not return a copy of the sorted data:

>>> s4.sort(na_position='first')
>>> s4
024     NaN
065     NaN
002    Mary
001     Nam
dtype: object

If we want to apply sort function to a DataFrame object, we need to figure out which columns or rows will be sorted:

>>> df7.sort(['b', 'd'], ascending=False)
   b  d   a   c
z  8  9  10  11
y  4  5   6   7
x  0  1   2   3

If we do not want to automatically save the sorting result to the current data object, we can change the setting of the inplace parameter to False.

Binary operations

Firstly, we

will consider arithmetic operations between objects. In different indexes objects case, the expected result will be the union of the index pairs. We will not explain this again because we had an example about it in the above section (s5 + s6). This time, we will show another example with a DataFrame:

>>> df5 = pd.DataFrame(np.arange(9).reshape(3,3),0
                       columns=['a','b','c'])
>>> df5
   a  b  c
0  0  1  2
1  3  4  5
2  6  7  8
>>> df6 = pd.DataFrame(np.arange(8).reshape(2,4), 
                      columns=['a','b','c','d'])
>>> df6
   a  b  c  d
0  0  1  2  3
1  4  5  6  7
>>> df5 + df6
    a   b   c   d
0   0   2   4 NaN
1   7   9  11 NaN
2   NaN NaN NaN NaN

>>> df7 = df5.add(df6, fill_value=0)
>>> df7
   a  b   c   d
0  0  2   4   3
1  7  9  11   7
2  6  7   8   NaN

>>> df5.eq(df6)
       a      b      c      d
0   True   True   True  False
1  False  False  False  False
2  False  False  False  False

Functional statistics

>>> df5.sum()
a     9
b    12
c    15
dtype: int64

When we do not specify which axis we want to calculate sum information, by default, the function will calculate on index axis, which is axis 0:

Series: We do not need to specify the axis.
DataFrame: Columns (axis = 1) or index (axis = 0). The default setting is axis 0.

We also have the skipna parameter that allows us to decide whether to exclude missing data or not. By default, it is set as true:

>>> df7.sum(skipna=False)
a    13
b    18
c    23
d   NaN
dtype: float64

Another function that we want to consider is describe(). It is very convenient for us to summarize most of the statistical information of a data structure such as the Series and DataFrame, as well:

>>> df5.describe()
         a    b    c
count  3.0  3.0  3.0
mean   3.0  4.0  5.0
std    3.0  3.0  3.0
min    0.0  1.0  2.0
25%    1.5  2.5  3.5
50%    3.0  4.0  5.0
75%    4.5  5.5  6.5
max    6.0  7.0  8.0

We can specify percentiles to include or exclude in the output by using the percentiles parameter; for example, consider the following:

>>> df5.describe(percentiles=[0.5, 0.8])
         a    b    c
count  3.0  3.0  3.0
mean   3.0  4.0  5.0
std    3.0  3.0  3.0
min    0.0  1.0  2.0
50%    3.0  4.0  5.0
80%    4.8  5.8  6.8
max    6.0  7.0  8.0

Here, we have a summary table for common supported statistics functions in Pandas:

Function	Description
`idxmin(axis)`, `idxmax(axis)`	This compute the index labels with the minimum or maximum corresponding values.
`value_counts()`	This compute the frequency of unique values.
`count()`	This return the number of non-null values in a data object.
`mean()`, `median()`, `min()`, `max()`	This return mean, median, minimum, and maximum values of an axis in a data object.
`std()`, `var()`, `sem()`	These return the standard deviation, variance, and standard error of mean.
`abs()`	This gets the absolute value of a data object.

Function application

>>> df5.apply(np.std, axis=1)    # default: axis=0
0    0.816497
1    0.816497
2    0.816497
dtype: float64

Secondly, if we want to apply a formula to a data object, we can also useapply function by following these steps:

Define the function or formula that you want to apply on a data object.

Call the defined function or formula via apply. In this step, we also need to figure out the axis that we want to apply the calculation to:

>>> f = lambda x: x.max() – x.min()    # step 1
>>> df5.apply(f, axis=1)               # step 2
0    2
1    2
2    2
dtype: int64
>>> def sigmoid(x):
    return 1/(1 + np.exp(x))
>>> df5.apply(sigmoid)
     a           b         c
0  0.500000  0.268941  0.119203
1  0.047426  0.017986  0.006693
2  0.002473  0.000911  0.000335

Sorting

There are two kinds of sorting method that we are interested in: sorting by row or column index and sorting by data value.

>>> df7 = pd.DataFrame(np.arange(12).reshape(3,4),  
                       columns=['b', 'd', 'a', 'c'],
                       index=['x', 'y', 'z'])
>>> df7
   b  d   a   c
x  0  1   2   3
y  4  5   6   7
z  8  9  10  11
>>> df7.sort_index(axis=1)
    a  b   c  d
x   2  0   3  1
y   6  4   7  5
z  10  8  11  9

Series has a method order that sorts by value. For NaN values in the object, we can also have a special treatment via the na_position option:

>>> s4.order(na_position='first')
024     NaN
065     NaN
002    Mary
001     Nam
dtype: object
>>> s4
002    Mary
001     Nam
024     NaN
065     NaN
dtype: object

Besides that, Series also has the sort() function that sorts data by value. However, the function will not return a copy of the sorted data:

>>> s4.sort(na_position='first')
>>> s4
024     NaN
065     NaN
002    Mary
001     Nam
dtype: object

If we want to apply sort function to a DataFrame object, we need to figure out which columns or rows will be sorted:

>>> df7.sort(['b', 'd'], ascending=False)
   b  d   a   c
z  8  9  10  11
y  4  5   6   7
x  0  1   2   3

If we do not want to automatically save the sorting result to the current data object, we can change the setting of the inplace parameter to False.

Functional statistics

The

supported statistics method of a library is really important in data analysis. To get inside a big data object, we need to know some summarized information such as mean, sum, or quantile. Pandas supports a large number of methods to compute them. Let's consider a simple example of calculating the sum information of df5, which is a DataFrame object:

>>> df5.sum()
a     9
b    12
c    15
dtype: int64

When we do not specify which axis we want to calculate sum information, by default, the function will calculate on index axis, which is axis 0:

Series: We do not need to specify the axis.
DataFrame: Columns (axis = 1) or index (axis = 0). The default setting is axis 0.

We also have the skipna parameter that allows us to decide whether to exclude missing data or not. By default, it is set as true:

>>> df7.sum(skipna=False)
a    13
b    18
c    23
d   NaN
dtype: float64

Another function that we want to consider is describe(). It is very convenient for us to summarize most of the statistical information of a data structure such as the Series and DataFrame, as well:

>>> df5.describe()
         a    b    c
count  3.0  3.0  3.0
mean   3.0  4.0  5.0
std    3.0  3.0  3.0
min    0.0  1.0  2.0
25%    1.5  2.5  3.5
50%    3.0  4.0  5.0
75%    4.5  5.5  6.5
max    6.0  7.0  8.0

We can specify percentiles to include or exclude in the output by using the percentiles parameter; for example, consider the following:

>>> df5.describe(percentiles=[0.5, 0.8])
         a    b    c
count  3.0  3.0  3.0
mean   3.0  4.0  5.0
std    3.0  3.0  3.0
min    0.0  1.0  2.0
50%    3.0  4.0  5.0
80%    4.8  5.8  6.8
max    6.0  7.0  8.0

Here, we have a summary table for common supported statistics functions in Pandas:

Function	Description
`idxmin(axis)`, `idxmax(axis)`	This compute the index labels with the minimum or maximum corresponding values.
`value_counts()`	This compute the frequency of unique values.
`count()`	This return the number of non-null values in a data object.
`mean()`, `median()`, `min()`, `max()`	This return mean, median, minimum, and maximum values of an axis in a data object.
`std()`, `var()`, `sem()`	These return the standard deviation, variance, and standard error of mean.
`abs()`	This gets the absolute value of a data object.

Function application

>>> df5.apply(np.std, axis=1)    # default: axis=0
0    0.816497
1    0.816497
2    0.816497
dtype: float64

Secondly, if we want to apply a formula to a data object, we can also useapply function by following these steps:

Define the function or formula that you want to apply on a data object.

Call the defined function or formula via apply. In this step, we also need to figure out the axis that we want to apply the calculation to:

>>> f = lambda x: x.max() – x.min()    # step 1
>>> df5.apply(f, axis=1)               # step 2
0    2
1    2
2    2
dtype: int64
>>> def sigmoid(x):
    return 1/(1 + np.exp(x))
>>> df5.apply(sigmoid)
     a           b         c
0  0.500000  0.268941  0.119203
1  0.047426  0.017986  0.006693
2  0.002473  0.000911  0.000335

Sorting

There are two kinds of sorting method that we are interested in: sorting by row or column index and sorting by data value.

>>> df7 = pd.DataFrame(np.arange(12).reshape(3,4),  
                       columns=['b', 'd', 'a', 'c'],
                       index=['x', 'y', 'z'])
>>> df7
   b  d   a   c
x  0  1   2   3
y  4  5   6   7
z  8  9  10  11
>>> df7.sort_index(axis=1)
    a  b   c  d
x   2  0   3  1
y   6  4   7  5
z  10  8  11  9

Series has a method order that sorts by value. For NaN values in the object, we can also have a special treatment via the na_position option:

>>> s4.order(na_position='first')
024     NaN
065     NaN
002    Mary
001     Nam
dtype: object
>>> s4
002    Mary
001     Nam
024     NaN
065     NaN
dtype: object

Besides that, Series also has the sort() function that sorts data by value. However, the function will not return a copy of the sorted data:

>>> s4.sort(na_position='first')
>>> s4
024     NaN
065     NaN
002    Mary
001     Nam
dtype: object

If we want to apply sort function to a DataFrame object, we need to figure out which columns or rows will be sorted:

>>> df7.sort(['b', 'd'], ascending=False)
   b  d   a   c
z  8  9  10  11
y  4  5   6   7
x  0  1   2   3

If we do not want to automatically save the sorting result to the current data object, we can change the setting of the inplace parameter to False.

Function application

Pandas

supports function application that allows us to apply some functions supported in other packages such as NumPy or our own functions on data structure objects. Here, we illustrate two examples of these cases, firstly, using apply to execute the std() function, which is the standard deviation calculating function of the NumPy package:

>>> df5.apply(np.std, axis=1)    # default: axis=0
0    0.816497
1    0.816497
2    0.816497
dtype: float64

Secondly, if we want to apply a formula to a data object, we can also useapply function by following these steps:

Define the function or formula that you want to apply on a data object.

Call the defined function or formula via apply. In this step, we also need to figure out the axis that we want to apply the calculation to:

>>> f = lambda x: x.max() – x.min()    # step 1
>>> df5.apply(f, axis=1)               # step 2
0    2
1    2
2    2
dtype: int64
>>> def sigmoid(x):
    return 1/(1 + np.exp(x))
>>> df5.apply(sigmoid)
     a           b         c
0  0.500000  0.268941  0.119203
1  0.047426  0.017986  0.006693
2  0.002473  0.000911  0.000335

Sorting

There are two kinds of sorting method that we are interested in: sorting by row or column index and sorting by data value.

>>> df7 = pd.DataFrame(np.arange(12).reshape(3,4),  
                       columns=['b', 'd', 'a', 'c'],
                       index=['x', 'y', 'z'])
>>> df7
   b  d   a   c
x  0  1   2   3
y  4  5   6   7
z  8  9  10  11
>>> df7.sort_index(axis=1)
    a  b   c  d
x   2  0   3  1
y   6  4   7  5
z  10  8  11  9

Series has a method order that sorts by value. For NaN values in the object, we can also have a special treatment via the na_position option:

>>> s4.order(na_position='first')
024     NaN
065     NaN
002    Mary
001     Nam
dtype: object
>>> s4
002    Mary
001     Nam
024     NaN
065     NaN
dtype: object

Besides that, Series also has the sort() function that sorts data by value. However, the function will not return a copy of the sorted data:

>>> s4.sort(na_position='first')
>>> s4
024     NaN
065     NaN
002    Mary
001     Nam
dtype: object

If we want to apply sort function to a DataFrame object, we need to figure out which columns or rows will be sorted:

>>> df7.sort(['b', 'd'], ascending=False)
   b  d   a   c
z  8  9  10  11
y  4  5   6   7
x  0  1   2   3

If we do not want to automatically save the sorting result to the current data object, we can change the setting of the inplace parameter to False.

Sorting

There

are two kinds of sorting method that we are interested in: sorting by row or column index and sorting by data value.

>>> df7 = pd.DataFrame(np.arange(12).reshape(3,4),  
                       columns=['b', 'd', 'a', 'c'],
                       index=['x', 'y', 'z'])
>>> df7
   b  d   a   c
x  0  1   2   3
y  4  5   6   7
z  8  9  10  11
>>> df7.sort_index(axis=1)
    a  b   c  d
x   2  0   3  1
y   6  4   7  5
z  10  8  11  9

Series has a method order that sorts by value. For NaN values in the object, we can also have a special treatment via the na_position option:

>>> s4.order(na_position='first')
024     NaN
065     NaN
002    Mary
001     Nam
dtype: object
>>> s4
002    Mary
001     Nam
024     NaN
065     NaN
dtype: object

Besides that, Series also has the sort() function that sorts data by value. However, the function will not return a copy of the sorted data:

>>> s4.sort(na_position='first')
>>> s4
024     NaN
065     NaN
002    Mary
001     Nam
dtype: object

If we want to apply sort function to a DataFrame object, we need to figure out which columns or rows will be sorted:

>>> df7.sort(['b', 'd'], ascending=False)
   b  d   a   c
z  8  9  10  11
y  4  5   6   7
x  0  1   2   3

If we do not want to automatically save the sorting result to the current data object, we can change the setting of the inplace parameter to False.

Indexing and selecting data

In this section, we will focus on how to get, set, or slice subsets of Pandas data structure objects. As we learned in previous sections, Series or DataFrame objects have axis labeling information. This information can be used to identify items that we want to select or assign a new value to in the object:

>>> s4[['024', '002']]    # selecting data of Series object
024     NaN
002    Mary
dtype: object
>>> s4[['024', '002']] = 'unknown' # assigning data
>>> s4
024    unknown
065        NaN
002    unknown
001        Nam
dtype: object

If the data object is a DataFrame structure, we can also proceed in a similar way:

>>> df5[['b', 'c']]
   b  c
0  1  2
1  4  5
2  7  8

For label indexing on the rows of DataFrame, we use the ix function that enables us to select a set of rows and columns in the object. There are two parameters that we need to specify: the row and column labels that we want to get. By default, if we do not specify the selected column names, the function will return selected rows with all columns in the object:

>>> df5.ix[0]
a    0
b    1
c    2
Name: 0, dtype: int64
>>> df5.ix[0, 1:3]
b    1
c    2
Name: 0, dtype: int64

Moreover, we have many ways to select and edit data contained in a Pandas object. We summarize these functions in the following table:

Method	Description
`icol`, `irow`	This selects a single row or column by integer location.
`get_value`, `set_value`	This selects or sets a single value of a data object by row or column label.
`xs`	This selects a single column or row as a Series by label.

Tip

Pandas data objects may contain duplicate indices. In this case, when we get or set a data value via index label, it will affect all rows or columns that have the same selected index name.

Computational tools

Let's start with correlation and covariance computation between two data objects. Both the Series and DataFrame have a cov method. On a DataFrame object, this method will compute the covariance between the Series inside the object:

>>> s1 = pd.Series(np.random.rand(3))
>>> s1
0    0.460324
1    0.993279
2    0.032957
dtype: float64
>>> s2 = pd.Series(np.random.rand(3))
>>> s2
0    0.777509
1    0.573716
2    0.664212
dtype: float64
>>> s1.cov(s2)
-0.024516360159045424

>>> df8 = pd.DataFrame(np.random.rand(12).reshape(4,3),  
                       columns=['a','b','c'])
>>> df8
          a         b         c
0  0.200049  0.070034  0.978615
1  0.293063  0.609812  0.788773
2  0.853431  0.243656  0.978057
0.985584  0.500765  0.481180
>>> df8.cov()
          a         b         c
a  0.155307  0.021273 -0.048449
b  0.021273  0.059925 -0.040029
c -0.048449 -0.040029  0.055067

Usage of the correlation method is similar to the covariance method. It computes the correlation between Series inside a data object in case the data object is a DataFrame. However, we need to specify which method will be used to compute the correlations. The available methods are pearson, kendall, and spearman. By default, the function applies the spearman method:

>>> df8.corr(method = 'spearman')
     a    b    c
a  1.0  0.4 -0.8
b  0.4  1.0 -0.8
c -0.8 -0.8  1.0

We also have the corrwith function that supports calculating correlations between Series that have the same label contained in different DataFrame objects:

>>> df9 = pd.DataFrame(np.arange(8).reshape(4,2), 
                       columns=['a', 'b'])
>>> df9
   a  b
0  0  1
1  2  3
2  4  5
3  6  7
>>> df8.corrwith(df9)
a    0.955567
b    0.488370
c         NaN
dtype: float64

Working with missing data

In this section, we will discuss missing, NaN, or null values, in Pandas data structures. It is a very common situation to arrive with missing data in an object. One such case that creates missing data is reindexing:

>>> df8 = pd.DataFrame(np.arange(12).reshape(4,3),  
                       columns=['a', 'b', 'c'])
   a   b   c
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11
>>> df9 = df8.reindex(columns = ['a', 'b', 'c', 'd'])
   a   b   c   d
0  0   1   2 NaN
1  3   4   5 NaN
2  6   7   8 NaN
4  9  10  11 NaN
>>> df10 = df8.reindex([3, 2, 'a', 0])
    a   b   c
3   9  10  11
2   6   7   8
a NaN NaN NaN
0   0   1   2

To manipulate missing values, we can use the isnull() or notnull() functions to detect the missing values in a Series object, as well as in a DataFrame object:

>>> df10.isnull()
       a      b      c
3  False  False  False
2  False  False  False
a   True   True   True
0  False  False  False

On a Series, we can drop all null data and index values by using the dropna function:

>>> s4 = pd.Series({'001': 'Nam', '002': 'Mary',
                    '003': 'Peter'},
                    index=['002', '001', '024', '065'])
>>> s4
002    Mary
001     Nam
024     NaN
065     NaN
dtype: object
>>> s4.dropna()    # dropping all null value of Series object
002    Mary
001     Nam
dtype: object

With a DataFrame object, it is a little bit more complex than with Series. We can tell which rows or columns we want to drop and also if all entries must be null or a single null value is enough. By default, the function will drop any row containing a missing value:

>>> df9.dropna()    # all rows will be dropped
Empty DataFrame
Columns: [a, b, c, d]
Index: []
>>> df9.dropna(axis=1)
   a   b   c
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

Another way to control missing values is to use the supported parameters of functions that we introduced in the previous section. They are also very useful to solve this problem. In our experience, we should assign a fixed value in missing cases when we create data objects. This will make our objects cleaner in later processing steps. For example, consider the following:

>>> df11 = df8.reindex([3, 2, 'a', 0], fill_value = 0)
>>> df11
   a   b   c
3  9  10  11
2  6   7   8
a  0   0   0
0  0   1   2

We can alse use the fillna function to fill a custom value in missing values:

>>> df9.fillna(-1)
   a   b   c  d
0  0   1   2 -1
1  3   4   5 -1
2  6   7   8 -1
3  9  10  11 -1

Advanced uses of Pandas for data analysis

In this section we will consider some advanced Pandas use cases.

Hierarchical indexing

Hierarchical indexing provides us with a way to work with higher dimensional data in a lower dimension by structuring the data object into multiple index levels on an axis:

>>> s8 = pd.Series(np.random.rand(8), index=[['a','a','b','b','c','c', 'd','d'], [0, 1, 0, 1, 0,1, 0, 1, ]])
>>> s8
a  0    0.721652
   1    0.297784
b  0    0.271995
   1    0.125342
c  0    0.444074
   1    0.948363
d  0    0.197565
   1    0.883776
dtype: float64

In the preceding example, we have a Series object that has two index levels. The object can be rearranged into a DataFrame using the unstack function. In an inverse situation, the stack function can be used:

>>> s8.unstack()
          0         1
a  0.549211  0.420874
b  0.051516  0.715021
c  0.503072  0.720772
d  0.373037  0.207026

We can also create a DataFrame to have a hierarchical index in both axes:

>>> df = pd.DataFrame(np.random.rand(12).reshape(4,3),
                      index=[['a', 'a', 'b', 'b'],
                               [0, 1, 0, 1]],
                      columns=[['x', 'x', 'y'], [0, 1, 0]])
>>> df
            x                   y
            0         1         0
a 0  0.636893  0.729521  0.747230
  1  0.749002  0.323388  0.259496
b 0  0.214046  0.926961  0.679686
0.013258  0.416101  0.626927
>>> df.index
MultiIndex(levels=[['a', 'b'], [0, 1]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
>>> df.columns
MultiIndex(levels=[['x', 'y'], [0, 1]],
           labels=[[0, 0, 1], [0, 1, 0]])

The methods for getting or setting values or subsets of the data objects with multiple index levels are similar to those of the nonhierarchical case:

>>> df['x']
            0         1
a 0  0.636893  0.729521
  1  0.749002  0.323388
b 0  0.214046  0.926961
0.013258  0.416101
>>> df[[0]]
            x
            0
a 0  0.636893
  1  0.749002
b 0  0.214046
0.013258
>>> df.ix['a', 'x']
          0         1
0  0.636893  0.729521
0.749002  0.323388
>>> df.ix['a','x'].ix[1]
0    0.749002
1    0.323388
Name: 1, dtype: float64

After grouping data into multiple index levels, we can also use most of the descriptive and statistics functions that have a level option, which can be used to specify the level we want to process:

>>> df.std(level=1)
          x                   y
          0         1         0
0  0.298998  0.139611  0.047761
0.520250  0.065558  0.259813
>>> df.std(level=0)
          x                   y
          0         1         0
a  0.079273  0.287180  0.344880
b  0.141979  0.361232  0.037306

The Panel data

The Panel is another data structure for three-dimensional data in Pandas. However, it is less frequently used than the Series or the DataFrame. You can think of a Panel as a table of DataFrame objects. We can create a Panel object from a 3D ndarray or a dictionary of DataFrame objects:

# create a Panel from 3D ndarray
>>> panel = pd.Panel(np.random.rand(2, 4, 5),
                     items = ['item1', 'item2'])
>>> panel
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: item1 to item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4

>>> df1 = pd.DataFrame(np.arange(12).reshape(4, 3), 
                       columns=['a','b','c'])
>>> df1
   a   b   c
0  0   1   2
1  3   4   5
2  6   7   8
9  10  11
>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3), 
                       columns=['a','b','c'])
>>> df2
   a  b  c
0  0  1  2
1  3  4  5
6  7  8
# create another Panel from a dict of DataFrame objects
>>> panel2 = pd.Panel({'item1': df1, 'item2': df2})
>>> panel2
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: item1 to item2
Major_axis axis: 0 to 3
Minor_axis axis: a to c

Each item in a Panel is a DataFrame. We can select an item, by item name:

>>> panel2['item1']
   a   b   c
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

Alternatively, if we want to select data via an axis or data position, we can use the ix method, like on Series or DataFrame:

>>> panel2.ix[:, 1:3, ['b', 'c']]
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis)
Items axis: item1 to item2
Major_axis axis: 1 to 3
Minor_axis axis: b to c
>>> panel2.ix[:, 2, :]
   item1  item2
a      6      6
b      7      7
c      8      8

Hierarchical indexing

provides us with a way to work with higher dimensional data in a lower dimension by structuring the data object into multiple index levels on an axis:

>>> s8 = pd.Series(np.random.rand(8), index=[['a','a','b','b','c','c', 'd','d'], [0, 1, 0, 1, 0,1, 0, 1, ]])
>>> s8
a  0    0.721652
   1    0.297784
b  0    0.271995
   1    0.125342
c  0    0.444074
   1    0.948363
d  0    0.197565
   1    0.883776
dtype: float64

>>> s8.unstack()
          0         1
a  0.549211  0.420874
b  0.051516  0.715021
c  0.503072  0.720772
d  0.373037  0.207026

We can also create a DataFrame to have a hierarchical index in both axes:

>>> df = pd.DataFrame(np.random.rand(12).reshape(4,3),
                      index=[['a', 'a', 'b', 'b'],
                               [0, 1, 0, 1]],
                      columns=[['x', 'x', 'y'], [0, 1, 0]])
>>> df
            x                   y
            0         1         0
a 0  0.636893  0.729521  0.747230
  1  0.749002  0.323388  0.259496
b 0  0.214046  0.926961  0.679686
0.013258  0.416101  0.626927
>>> df.index
MultiIndex(levels=[['a', 'b'], [0, 1]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
>>> df.columns
MultiIndex(levels=[['x', 'y'], [0, 1]],
           labels=[[0, 0, 1], [0, 1, 0]])

The methods for getting or setting values or subsets of the data objects with multiple index levels are similar to those of the nonhierarchical case:

>>> df['x']
            0         1
a 0  0.636893  0.729521
  1  0.749002  0.323388
b 0  0.214046  0.926961
0.013258  0.416101
>>> df[[0]]
            x
            0
a 0  0.636893
  1  0.749002
b 0  0.214046
0.013258
>>> df.ix['a', 'x']
          0         1
0  0.636893  0.729521
0.749002  0.323388
>>> df.ix['a','x'].ix[1]
0    0.749002
1    0.323388
Name: 1, dtype: float64

After grouping data into multiple index levels, we can also use most of the descriptive and statistics functions that have a level option, which can be used to specify the level we want to process:

>>> df.std(level=1)
          x                   y
          0         1         0
0  0.298998  0.139611  0.047761
0.520250  0.065558  0.259813
>>> df.std(level=0)
          x                   y
          0         1         0
a  0.079273  0.287180  0.344880
b  0.141979  0.361232  0.037306

The Panel data

# create a Panel from 3D ndarray
>>> panel = pd.Panel(np.random.rand(2, 4, 5),
                     items = ['item1', 'item2'])
>>> panel
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: item1 to item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4

>>> df1 = pd.DataFrame(np.arange(12).reshape(4, 3), 
                       columns=['a','b','c'])
>>> df1
   a   b   c
0  0   1   2
1  3   4   5
2  6   7   8
9  10  11
>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3), 
                       columns=['a','b','c'])
>>> df2
   a  b  c
0  0  1  2
1  3  4  5
6  7  8
# create another Panel from a dict of DataFrame objects
>>> panel2 = pd.Panel({'item1': df1, 'item2': df2})
>>> panel2
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: item1 to item2
Major_axis axis: 0 to 3
Minor_axis axis: a to c

Each item in a Panel is a DataFrame. We can select an item, by item name:

>>> panel2['item1']
   a   b   c
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

Alternatively, if we want to select data via an axis or data position, we can use the ix method, like on Series or DataFrame:

>>> panel2.ix[:, 1:3, ['b', 'c']]
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis)
Items axis: item1 to item2
Major_axis axis: 1 to 3
Minor_axis axis: b to c
>>> panel2.ix[:, 2, :]
   item1  item2
a      6      6
b      7      7
c      8      8

The Panel data

The Panel is

another data structure for three-dimensional data in Pandas. However, it is less frequently used than the Series or the DataFrame. You can think of a Panel as a table of DataFrame objects. We can create a Panel object from a 3D ndarray or a dictionary of DataFrame objects:

# create a Panel from 3D ndarray
>>> panel = pd.Panel(np.random.rand(2, 4, 5),
                     items = ['item1', 'item2'])
>>> panel
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: item1 to item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4

>>> df1 = pd.DataFrame(np.arange(12).reshape(4, 3), 
                       columns=['a','b','c'])
>>> df1
   a   b   c
0  0   1   2
1  3   4   5
2  6   7   8
9  10  11
>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3), 
                       columns=['a','b','c'])
>>> df2
   a  b  c
0  0  1  2
1  3  4  5
6  7  8
# create another Panel from a dict of DataFrame objects
>>> panel2 = pd.Panel({'item1': df1, 'item2': df2})
>>> panel2
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: item1 to item2
Major_axis axis: 0 to 3
Minor_axis axis: a to c

Each item in a Panel is a DataFrame. We can select an item, by item name:

>>> panel2['item1']
   a   b   c
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

Alternatively, if we want to select data via an axis or data position, we can use the ix method, like on Series or DataFrame:

>>> panel2.ix[:, 1:3, ['b', 'c']]
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis)
Items axis: item1 to item2
Major_axis axis: 1 to 3
Minor_axis axis: b to c
>>> panel2.ix[:, 2, :]
   item1  item2
a      6      6
b      7      7
c      8      8

Summary

We have finished covering the basics of the Pandas data analysis library. Whenever you learn about a library for data analysis, you need to consider the three parts that we explained in this chapter. Data structures: we have two common data object types in the Pandas library; Series and DataFrames. Method to access and manipulate data objects: Pandas supports many way to select, set or slice subsets of data object. However, the general mechanism is using index labels or the positions of items to identify values. Functions and utilities: They are the most important part of a powerful library. In this chapter, we covered all common supported functions of Pandas which allow us compute statistics on data easily. The library also has a lot of other useful functions and utilities that we could not explain in this chapter. We encourage you to start your own research, if you want to expand your experience with Pandas. It helps us to process large data in an optimized way. You will see more of Pandas in action later in this book.

Until now, we learned about two popular Python libraries: NumPy and Pandas. Pandas is built on NumPy, and as a result it allows for a bit more convenient interaction with data. However, in some situations, we can flexibly combine both of them to accomplish our goals.

Practice exercises

The link https://www.census.gov/2010census/csv/pop_change.csv contains an US census dataset. It has 23 columns and one row for each US state, as well as a few rows for macro regions such as North, South, and West.

Get this dataset into a Pandas DataFrame. Hint: just skip those rows that do not seem helpful, such as comments or description.
While the dataset contains change metrics for each decade, we are interested in the population change during the second half of the twentieth century, that is between, 1950 and 2000. Which region has seen the biggest and the smallest population growth in this time span? Also, which US state?

Advanced open-ended exercise:

Find more census data on the internet; not just on the US but on the world's countries. Try to find GDP data for the same time as well. Try to align this data to explore patterns. How are GDP and population growth related? Are there any special cases. such as countries with high GDP but low population growth or countries with the opposite history?

Python: End-to-end Data Analysis

By : Ivan Idris, Luiz Felipe Martins, Martin Czygan, Phuong Vo.T.H, Magnus Vilhelm Persson

Python: End-to-end Data Analysis

By: Ivan Idris, Luiz Felipe Martins, Martin Czygan, Phuong Vo.T.H, Magnus Vilhelm Persson

Overview of this book

Related Content you might be interested in

Current Title:

Python: End-to-end Data Analysis