Comparing missing values

pandas uses the NumPy NaN (np.nan) object to represent a missing value. This is an unusual object and has interesting mathematical properties. For instance, it is not equal to itself. Even Python's None object evaluates as True when compared to itself:

>>> np.nan == np.nan
False
>>> None == None
True

All other comparisons against np.nan also return False, except not equal to (!=):

>>> np.nan > 5
False
>>> 5 > np.nan
False
>>> np.nan != 5
True

Getting ready

Series and DataFrames use the equals operator, ==, to make element-by-element comparisons. The result is an object with the same dimensions. This recipe shows you how to use the equals operator, which is very different from the .equals method.

As in the previous recipe, the columns representing the fraction of each race of undergraduate students from the college dataset will be used:

>>> college = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")

How to do it...

To get an idea of how the equals operator works, let's compare each element to a scalar value:

>>> college_ugds == 0.0019
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Amridge U...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Alabama S...       False       False  ...     False      False
...                  ...         ...  ...       ...        ...
SAE Insti...       False       False  ...     False      False
Rasmussen...       False       False  ...     False      False
National ...       False       False  ...     False      False
Bay Area ...       False       False  ...     False      False
Excel Lea...       False       False  ...     False      False

This works as expected but becomes problematic whenever you attempt to compare DataFrames with missing values. You may be tempted to use the equals operator to compare two DataFrames with one another on an element-by-element basis. Take, for instance, college_ugds compared against itself, as follows:

>>> college_self_compare = college_ugds == college_ugds
>>> college_self_compare.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...
Alabama A...        True        True  ...      True       True
Universit...        True        True  ...      True       True
Amridge U...        True        True  ...      True       True
Universit...        True        True  ...      True       True
Alabama S...        True        True  ...      True       True

At first glance, all the values appear to be equal, as you would expect. However, using the .all method to determine if each column contains only True values yields an unexpected result:

>>> college_self_compare.all()
UGDS_WHITE    False
UGDS_BLACK    False
UGDS_HISP     False
UGDS_ASIAN    False
UGDS_AIAN     False
UGDS_NHPI     False
UGDS_2MOR     False
UGDS_NRA      False
UGDS_UNKN     False
dtype: bool

This happens because missing values do not compare equally with one another. If you tried to count missing values using the equal operator and summing up the Boolean columns, you would get zero for each one:

>>> (college_ugds == np.nan).sum()
UGDS_WHITE    0
UGDS_BLACK    0
UGDS_HISP     0
UGDS_ASIAN    0
UGDS_AIAN     0
UGDS_NHPI     0
UGDS_2MOR     0
UGDS_NRA      0
UGDS_UNKN     0
dtype: int64

Instead of using == to find missing numbers, use the .isna method:

>>> college_ugds.isna().sum()
UGDS_WHITE    661
UGDS_BLACK    661
UGDS_HISP     661
UGDS_ASIAN    661
UGDS_AIAN     661
UGDS_NHPI     661
UGDS_2MOR     661
UGDS_NRA      661
UGDS_UNKN     661
dtype: int64

The correct way to compare two entire DataFrames with one another is not with the equals operator (==) but with the .equals method. This method treats NaNs that are in the same location as equal (note that the .eq method is the equivalent of ==):
```
>>> college_ugds.equals(college_ugds)
True
```

How it works...

Step 1 compares a DataFrame to a scalar value while step 2 compares a DataFrame with another DataFrame. Both operations appear to be quite simple and intuitive at first glance. The second operation is checking whether the DataFrames have identically labeled indexes and thus the same number of elements. The operation will fail if this isn't the case.

Step 3 verifies that none of the columns in the DataFrames are equivalent to each other. Step 4 further shows the non-equivalence of np.nan and itself. Step 5 verifies that there are indeed missing values in the DataFrame. Finally, step 6 shows the correct way to compare DataFrames with the .equals method, which always returns a Boolean scalar value.

There's more...

All the comparison operators have method counterparts that allow for more functionality. Somewhat confusingly, the .eq DataFrame method does element-by-element comparison, just like the equals (==) operator. The .eq method is not at all the same as the .equals method. The following code duplicates step 1:

>>> college_ugds.eq(0.0019)  # same as college_ugds == .0019
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Amridge U...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Alabama S...       False       False  ...     False      False
...                  ...         ...  ...       ...        ...
SAE Insti...       False       False  ...     False      False
Rasmussen...       False       False  ...     False      False
National ...       False       False  ...     False      False
Bay Area ...       False       False  ...     False      False
Excel Lea...       False       False  ...     False      False

Inside the pandas.testing sub-package, a function exists that developers should use when creating unit tests. The assert_frame_equal function raises an AssertionError if two DataFrames are not equal. It returns None if the two DataFrames are equal:

>>> from pandas.testing import assert_frame_equal
>>> assert_frame_equal(college_ugds, college_ugds) is None
True

Unit tests are a very important part of software development and ensure that the code is running correctly. pandas contains many thousands of unit tests that help ensure that it is running properly. To read more on how pandas runs its unit tests, see the Contributing to pandas section in the documentation (http://bit.ly/2vmCSU6).

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

Pandas 1.x Cookbook - Second Edition

By : Matthew Harrison, Theodore Petrou

Pandas 1.x Cookbook

By: Matthew Harrison, Theodore Petrou

Overview of this book

Comparing missing values

Getting ready

How to do it...

How it works...

There's more...

Pandas 1.x Cookbook - Second Edition

By : Matthew Harrison, Theodore Petrou

Pandas 1.x Cookbook

By: Matthew Harrison, Theodore Petrou

Overview of this book

Comparing missing values

Getting ready

How to do it...

How it works...

There's more...

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access