Book Image

Pandas 1.x Cookbook - Second Edition

By : Matt Harrison, Theodore Petrou
Book Image

Pandas 1.x Cookbook - Second Edition

By: Matt Harrison, Theodore Petrou

Overview of this book

The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands as one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through situations that you are highly likely to encounter. This new updated and revised edition provides you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. Many advanced recipes combine several different features across the pandas library to generate results.
Table of Contents (17 chapters)
15
Other Books You May Enjoy
16
Index

Determining college campus diversity

Many articles are written every year on the different aspects and impacts of diversity on college campuses. Various organizations have developed metrics attempting to measure diversity. US News is a leader in providing rankings for many different categories of colleges, with diversity being one of them. Their top 10 diverse colleges with Diversity Index are given as follows:

>>> pd.read_csv(
...     "data/college_diversity.csv", index_col="School"
... )
                                                   Diversity Index
School
Rutgers University--Newark  Newark, NJ                        0.76
Andrews University  Berrien Springs, MI                       0.74
Stanford University  Stanford, CA                             0.74
University of Houston  Houston, TX                            0.74
University of Nevada--Las Vegas  Las Vegas, NV                0.74
University of San Francisco  San Francisco, CA                0.74
San Francisco State University  San Francisco, CA             0.73
University of Illinois--Chicago  Chicago, IL                  0.73
New Jersey Institute of Technology  Newark, NJ                0.72
Texas Woman's University  Denton, TX                          0.72

Our college dataset classifies race into nine different categories. When trying to quantify something without an obvious definition, such as diversity, it helps to start with something simple. In this recipe, our diversity metric will equal the count of the number of races having greater than 15% of the student population.

How to do it...

  1. Read in the college dataset, and filter for just the undergraduate race columns:
    >>> college = pd.read_csv(
    ...     "data/college.csv", index_col="INSTNM"
    ... )
    >>> college_ugds = college.filter(like="UGDS_")
    
  2. Many of these colleges have missing values for all their race columns. We can count all the missing values for each row and sort the resulting Series from the highest to lowest. This will reveal the colleges that have missing values:
    >>> (
    ...     college_ugds.isnull()
    ...     .sum(axis="columns")
    ...     .sort_values(ascending=False)
    ...     .head()
    ... )
    INSTNM
    Excel Learning Center-San Antonio South         9
    Philadelphia College of Osteopathic Medicine    9
    Assemblies of God Theological Seminary          9
    Episcopal Divinity School                       9
    Phillips Graduate Institute                     9
    dtype: int64
    
  3. Now that we have seen the colleges that are missing all their race columns, we can use the .dropna method to drop all rows that have all nine race percentages missing. We can then count the remaining missing values:
    >>> college_ugds = college_ugds.dropna(how="all")
    >>>; college_ugds.isnull().sum()
    UGDS_WHITE    0
    UGDS_BLACK    0
    UGDS_HISP     0
    UGDS_ASIAN    0
    UGDS_AIAN     0
    UGDS_NHPI     0
    UGDS_2MOR     0
    UGDS_NRA      0
    UGDS_UNKN     0
    dtype: int64
    
  4. There are no missing values left in the dataset. We can now calculate our diversity metric. To get started, we will use the greater than or equal DataFrame method, .ge, to return a DataFrame with a Boolean value for each cell:
    >>> college_ugds.ge(0.15)
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...       False        True  ...     False      False
    Universit...        True        True  ...     False      False
    Amridge U...        True        True  ...     False       True
    Universit...        True       False  ...     False      False
    Alabama S...       False        True  ...     False      False
    ...                  ...         ...  ...       ...        ...
    Hollywood...        True        True  ...     False      False
    Hollywood...       False        True  ...     False      False
    Coachella...        True       False  ...     False      False
    Dewey Uni...       False       False  ...     False      False
    Coastal P...        True        True  ...     False      False
    
  5. From here, we can use the .sum method to count the True values for each college. Notice that a Series is returned:
    >>> diversity_metric = college_ugds.ge(0.15).sum(
    ...     axis="columns"
    ... )
    >>> diversity_metric.head()
    INSTNM
    Alabama A & M University               1
    University of Alabama at Birmingham    2
    Amridge University                     3
    University of Alabama in Huntsville    1
    Alabama State University               1
    dtype: int64
    
  6. To get an idea of the distribution, we will use the .value_counts method on this Series:
    >>> diversity_metric.value_counts()
    1    3042
    2    2884
    3     876
    4      63
    0       7
    5       2
    dtype: int64
    
  7. Amazingly, two schools have more than 15% in five different race categories. Let's sort the diversity_metric Series to find out which ones they are:
    >>> diversity_metric.sort_values(ascending=False).head()
    INSTNM
    Regency Beauty Institute-Austin          5
    Central Texas Beauty College-Temple      5
    Sullivan and Cogliano Training Center    4
    Ambria College of Nursing                4
    Berkeley College-New York                4
    dtype: int64
    
  8. It seems a little suspicious that schools can be that diverse. Let's look at the raw percentages from these top two schools. We will use .loc to select rows based on the index label:
    >>> college_ugds.loc[
    ...     [
    ...         "Regency Beauty Institute-Austin",
    ...         "Central Texas Beauty College-Temple",
    ...     ]
    ... ]
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...
    Regency B...      0.1867      0.2133  ...       0.0     0.2667
    Central T...      0.1616      0.2323  ...       0.0     0.1515
    
  9. It appears that several categories were aggregated into the unknown and two or more races column. Regardless of this, they both appear to be quite diverse. We can see how the top five US News schools fared with this basic diversity metric:
    >>> us_news_top = [
    ...     "Rutgers University-Newark",
    ...     "Andrews University",
    ...     "Stanford University",
    ...     "University of Houston",
    ...     "University of Nevada-Las Vegas",
    ... ]
    >>> diversity_metric.loc[us_news_top]
    INSTNM
    Rutgers University-Newark         4
    Andrews University                3
    Stanford University               3
    University of Houston             3
    University of Nevada-Las Vegas    3
    dtype: int64
    

How it works...

Step 2 counts and then displays the schools with the highest number of missing values. As there are nine columns in the DataFrame, the maximum number of missing values per school is nine. Many schools are missing values for each column. Step 3 removes rows that have all their values missing. The .dropna method in step 3 has the how parameter, which defaults to the string 'any', but may also be changed to 'all'. When set to 'any', it drops rows that contain one or more missing values. When set to 'all', it only drops rows where all values are missing.

In this case, we conservatively drop rows that are missing all values. This is because it's possible that some missing values represent 0 percent. This did not happen to be the case here, as there were no missing values after the dropna method was performed. If there were still missing values, we could have run the .fillna(0) method to fill all the remaining values with 0.

Step 5 begins our diversity metric calculation using the greater than or equal to method, .ge. This results in a DataFrame of all Booleans, which is summed horizontally by setting axis='columns'.

The .value_counts method is used in step 6 to produce a distribution of our diversity metric. It is quite rare for schools to have three races with 15% or more of the undergraduate student population. Step 7 and step 8 find two schools that are the most diverse based on our metric. Although they are diverse, it appears that many of the races are not fully accounted for and are defaulted into the unknown and two or more categories.

Step 9 selects the top five schools from the US News article. It then selects their diversity metric from our newly created Series. It turns out that these schools also score highly with our simple ranking system.

There's more...

Alternatively, we can find the schools that are least diverse by ordering them by their maximum race percentage:

>>> (
...     college_ugds.max(axis=1)
...     .sort_values(ascending=False)
...     .head(10)
... )
INSTNM
Dewey University-Manati                               1.0
Yeshiva and Kollel Harbotzas Torah                    1.0
Mr Leon's School of Hair Design-Lewiston              1.0
Dewey University-Bayamon                              1.0
Shepherds Theological Seminary                        1.0
Yeshiva Gedolah Kesser Torah                          1.0
Monteclaro Escuela de Hoteleria y Artes Culinarias    1.0
Yeshiva Shaar Hatorah                                 1.0
Bais Medrash Elyon                                    1.0
Yeshiva of Nitra Rabbinical College                   1.0
dtype: float64

We can also determine if any school has all nine race categories exceeding 1%:

>>> (college_ugds > 0.01).all(axis=1).any()
True