Book Image

Pandas 1.x Cookbook - Second Edition

By : Matt Harrison, Theodore Petrou
Book Image

Pandas 1.x Cookbook - Second Edition

By: Matt Harrison, Theodore Petrou

Overview of this book

The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands as one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through situations that you are highly likely to encounter. This new updated and revised edition provides you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. Many advanced recipes combine several different features across the pandas library to generate results.
Table of Contents (17 chapters)
15
Other Books You May Enjoy
16
Index

DataFrame operations

A primer on operators was given in the Series operations recipe from Chapter 1, Pandas Foundations, which will be helpful here. The Python arithmetic and comparison operators work with DataFrames, as they do with Series.

When an arithmetic or comparison operator is used with a DataFrame, each value of each column gets the operation applied to it. Typically, when an operator is used with a DataFrame, the columns are either all numeric or all object (usually strings). If the DataFrame does not contain homogeneous data, then the operation is likely to fail. Let's see an example of this failure with the college dataset, which contains both numeric and object data types. Attempting to add 5 to each value of the DataFrame raises a TypeError as integers cannot be added to strings:

>>> colleges = pd.read_csv("data/college.csv")
>>> colleges + 5
Traceback (most recent call last):
  ...
TypeError: can only concatenate str (not "int") to str

To successfully use an operator with a DataFrame, first select homogeneous data. For this recipe, we will select all the columns that begin with 'UGDS_'. These columns represent the fraction of undergraduate students by race. To get started, we import the data and use the institution name as the label for our index, and then select the columns we desire with the .filter method:

>>> colleges = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = colleges.filter(like="UGDS_")
>>> college_ugds.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...      0.0333      0.9353  ...    0.0059     0.0138
Universit...      0.5922      0.2600  ...    0.0179     0.0100
Amridge U...      0.2990      0.4192  ...    0.0000     0.2715
Universit...      0.6988      0.1255  ...    0.0332     0.0350
Alabama S...      0.0158      0.9208  ...    0.0243     0.0137

This recipe uses multiple operators with a DataFrame to round the undergraduate columns to the nearest hundredth. We will then see how this result is equivalent to the .round method.

How to do it...

  1. pandas does bankers rounding, numbers that are exactly halfway between either side to the even side. Look at what happens to the UGDS_BLACK row of this series when we round it to two decimal places:
    >>> name = "Northwest-Shoals Community College"
    >>> college_ugds.loc[name]
    UGDS_WHITE    0.7912
    UGDS_BLACK    0.1250
    UGDS_HISP     0.0339
    UGDS_ASIAN    0.0036
    UGDS_AIAN     0.0088
    UGDS_NHPI     0.0006
    UGDS_2MOR     0.0012
    UGDS_NRA      0.0033
    UGDS_UNKN     0.0324
    Name: Northwest-Shoals Community College, dtype: float64
    >>> college_ugds.loc[name].round(2)
    UGDS_WHITE    0.79
    UGDS_BLACK    0.12
    UGDS_HISP     0.03
    UGDS_ASIAN    0.00
    UGDS_AIAN     0.01
    UGDS_NHPI     0.00
    UGDS_2MOR     0.00
    UGDS_NRA      0.00
    UGDS_UNKN     0.03
    Name: Northwest-Shoals Community College, dtype: float64
    

    If we add .0001 before rounding, it changes to rounding up:

    >>> (college_ugds.loc[name] + 0.0001).round(2)
    UGDS_WHITE    0.79
    UGDS_BLACK    0.13
    UGDS_HISP     0.03
    UGDS_ASIAN    0.00
    UGDS_AIAN     0.01
    UGDS_NHPI     0.00
    UGDS_2MOR     0.00
    UGDS_NRA      0.00
    UGDS_UNKN     0.03
    Name: Northwest-Shoals Community College, dtype: float64
    
  2. Let's do this to the DataFrame. To begin our rounding adventure with operators, we will first add .00501 to each value of college_ugds:
    >>> college_ugds + 0.00501
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...     0.03831     0.94031  ...   0.01091    0.01881
    Universit...     0.59721     0.26501  ...   0.02291    0.01501
    Amridge U...     0.30401     0.42421  ...   0.00501    0.27651
    Universit...     0.70381     0.13051  ...   0.03821    0.04001
    Alabama S...     0.02081     0.92581  ...   0.02931    0.01871
    ...                  ...         ...  ...       ...        ...
    SAE Insti...         NaN         NaN  ...       NaN        NaN
    Rasmussen...         NaN         NaN  ...       NaN        NaN
    National ...         NaN         NaN  ...       NaN        NaN
    Bay Area ...         NaN         NaN  ...       NaN        NaN
    Excel Lea...         NaN         NaN  ...       NaN        NaN
    
  3. Use the floor division operator, //, to round down to the nearest whole number percentage:
    >>> (college_ugds + 0.00501) // 0.01
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...         3.0        94.0  ...       1.0        1.0
    Universit...        59.0        26.0  ...       2.0        1.0
    Amridge U...        30.0        42.0  ...       0.0       27.0
    Universit...        70.0        13.0  ...       3.0        4.0
    Alabama S...         2.0        92.0  ...       2.0        1.0
    ...                  ...         ...  ...       ...        ...
    SAE Insti...         NaN         NaN  ...       NaN        NaN
    Rasmussen...         NaN         NaN  ...       NaN        NaN
    National ...         NaN         NaN  ...       NaN        NaN
    Bay Area ...         NaN         NaN  ...       NaN        NaN
    Excel Lea...         NaN         NaN  ...       NaN        NaN
    
  4. To complete the rounding exercise, divide by 100:
    >>> college_ugds_op_round =(
    ...     (college_ugds + 0.00501) // 0.01 / 100
    ... )
    >>> college_ugds_op_round.head()
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...        0.03        0.94  ...      0.01       0.01
    Universit...        0.59        0.26  ...      0.02       0.01
    Amridge U...        0.30        0.42  ...      0.00       0.27
    Universit...        0.70        0.13  ...      0.03       0.04
    Alabama S...        0.02        0.92  ...      0.02       0.01
    
  5. Now use the round DataFrame method to do the rounding automatically for us. Due to bankers rounding, we add a small fraction before rounding:
    >>> college_ugds_round = (college_ugds + 0.00001).round(2)
    >>> college_ugds_round
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...        0.03        0.94  ...      0.01       0.01
    Universit...        0.59        0.26  ...      0.02       0.01
    Amridge U...        0.30        0.42  ...      0.00       0.27
    Universit...        0.70        0.13  ...      0.03       0.04
    Alabama S...        0.02        0.92  ...      0.02       0.01
    ...                  ...         ...  ...       ...        ....
    SAE Insti...         NaN         NaN  ...       NaN        NaN
    Rasmussen...         NaN         NaN  ...       NaN        NaN
    National ...         NaN         NaN  ...       NaN        NaN
    Bay Area ...         NaN         NaN  ...       NaN        NaN
    Excel Lea...         NaN         NaN  ...       NaN        NaN
    
  6. Use the equals DataFrame method to test the equality of two DataFrames:
    >>> college_ugds_op_round.equals(college_ugds_round)
    True
    

How it works...

Steps 1 and 2 use the plus operator, which attempts to add a scalar value to each value of each column of the DataFrame. As the columns are all numeric, this operation works as expected. There are some missing values in each of the columns but they stay missing after the operation.

Mathematically, adding .005 should be enough so that the floor division in the next step correctly rounds to the nearest whole percentage. The trouble appears because of the inexactness of floating-point numbers:

>>> 0.045 + 0.005
0.049999999999999996

There is an extra .00001 added to each number to ensure that the floating-point representation has the first four digits the same as the actual value. This works because the maximum precision of all the points in the dataset is four decimal places.

Step 3 applies the floor division operator, //, to all the values in the DataFrame. As we are dividing by a fraction, in essence, it is multiplying each value by 100 and truncating any decimals. Parentheses are needed around the first part of the expression, as floor division has higher precedence than addition. Step 4 uses the division operator to return the decimal to the correct position.

In step 5, we reproduce the previous steps with the round method. Before we can do this, we must again add an extra .00001 to each DataFrame value for a different reason from step 2. NumPy and Python 3 round numbers that are exactly halfway between either side to the even number. The bankers rounding (or ties to even http://bit.ly/2x3V5TU) technique is not usually what is formally taught in schools. It does not consistently bias numbers to the higher side (http://bit.ly/2zhsPy8).

It is necessary here to round up so that both DataFrame values are equal. The .equals method determines if all the elements and indexes between two DataFrames are exactly the same and returns a Boolean.

There's more...

Just as with Series, DataFrames have method equivalents of the operators. You may replace the operators with their method equivalents:

>>> college2 = (
...     college_ugds.add(0.00501).floordiv(0.01).div(100)
... )
>>> college2.equals(college_ugds_op_round)
True