-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating
Pandas 1.x Cookbook - Second Edition
By :
pandas uses the NumPy NaN (np.nan) object to represent a missing value. This is an unusual object and has interesting mathematical properties. For instance, it is not equal to itself. Even Python's None object evaluates as True when compared to itself:
>>> np.nan == np.nan
False
>>> None == None
True
All other comparisons against np.nan also return False, except not equal to (!=):
>>> np.nan > 5
False
>>> 5 > np.nan
False
>>> np.nan != 5
True
Series and DataFrames use the equals operator, ==, to make element-by-element comparisons. The result is an object with the same dimensions. This recipe shows you how to use the equals operator, which is very different from the .equals method.
As in the previous recipe, the columns representing the fraction of each race of undergraduate students from the college dataset will be used:
>>> college = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")
>>> college_ugds == 0.0019
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... False False ... False False
Universit... False False ... False False
Amridge U... False False ... False False
Universit... False False ... False False
Alabama S... False False ... False False
... ... ... ... ... ...
SAE Insti... False False ... False False
Rasmussen... False False ... False False
National ... False False ... False False
Bay Area ... False False ... False False
Excel Lea... False False ... False False
college_ugds compared against itself, as follows:
>>> college_self_compare = college_ugds == college_ugds
>>> college_self_compare.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... True True ... True True
Universit... True True ... True True
Amridge U... True True ... True True
Universit... True True ... True True
Alabama S... True True ... True True
.all method to determine if each column contains only True values yields an unexpected result:
>>> college_self_compare.all()
UGDS_WHITE False
UGDS_BLACK False
UGDS_HISP False
UGDS_ASIAN False
UGDS_AIAN False
UGDS_NHPI False
UGDS_2MOR False
UGDS_NRA False
UGDS_UNKN False
dtype: bool
>>> (college_ugds == np.nan).sum()
UGDS_WHITE 0
UGDS_BLACK 0
UGDS_HISP 0
UGDS_ASIAN 0
UGDS_AIAN 0
UGDS_NHPI 0
UGDS_2MOR 0
UGDS_NRA 0
UGDS_UNKN 0
dtype: int64
== to find missing numbers, use the .isna method:
>>> college_ugds.isna().sum()
UGDS_WHITE 661
UGDS_BLACK 661
UGDS_HISP 661
UGDS_ASIAN 661
UGDS_AIAN 661
UGDS_NHPI 661
UGDS_2MOR 661
UGDS_NRA 661
UGDS_UNKN 661
dtype: int64
==) but with the .equals method. This method treats NaNs that are in the same location as equal (note that the .eq method is the equivalent of ==):
>>> college_ugds.equals(college_ugds)
True
Step 1 compares a DataFrame to a scalar value while step 2 compares a DataFrame with another DataFrame. Both operations appear to be quite simple and intuitive at first glance. The second operation is checking whether the DataFrames have identically labeled indexes and thus the same number of elements. The operation will fail if this isn't the case.
Step 3 verifies that none of the columns in the DataFrames are equivalent to each other. Step 4 further shows the non-equivalence of np.nan and itself. Step 5 verifies that there are indeed missing values in the DataFrame. Finally, step 6 shows the correct way to compare DataFrames with the .equals method, which always returns a Boolean scalar value.
All the comparison operators have method counterparts that allow for more functionality. Somewhat confusingly, the .eq DataFrame method does element-by-element comparison, just like the equals (==) operator. The .eq method is not at all the same as the .equals method. The following code duplicates step 1:
>>> college_ugds.eq(0.0019) # same as college_ugds == .0019
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... False False ... False False
Universit... False False ... False False
Amridge U... False False ... False False
Universit... False False ... False False
Alabama S... False False ... False False
... ... ... ... ... ...
SAE Insti... False False ... False False
Rasmussen... False False ... False False
National ... False False ... False False
Bay Area ... False False ... False False
Excel Lea... False False ... False False
Inside the pandas.testing sub-package, a function exists that developers should use when creating unit tests. The assert_frame_equal function raises an AssertionError if two DataFrames are not equal. It returns None if the two DataFrames are equal:
>>> from pandas.testing import assert_frame_equal
>>> assert_frame_equal(college_ugds, college_ugds) is None
True
Unit tests are a very important part of software development and ensure that the code is running correctly. pandas contains many thousands of unit tests that help ensure that it is running properly. To read more on how pandas runs its unit tests, see the Contributing to pandas section in the documentation (http://bit.ly/2vmCSU6).