Book Image

Pandas Cookbook

By : Theodore Petrou
Book Image

Pandas Cookbook

By: Theodore Petrou

Overview of this book

This book will provide you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas 0.20. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands like one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through practical situations that you are highly likely to encounter. Many advanced recipes combine several different features across the pandas 0.20 library to generate results.
Table of Contents (12 chapters)

Chaining Series methods together

In Python, every variable is an object, and all objects have attributes and methods that refer to or return more objects. The sequential invocation of methods using the dot notation is referred to as method chaining. Pandas is a library that lends itself well to method chaining, as many Series and DataFrame methods return more Series and DataFrames, upon which more methods can be called.

Getting ready

To motivate method chaining, let's take a simple English sentence and translate the chain of events into a chain of methods. Consider the sentence, A person drives to the store to buy food, then drives home and prepares, cooks, serves, and eats the food before cleaning the dishes.

A Python version of this sentence might take the following form:

>>> person.drive('store')\
.buy('food')\
.drive('home')\
.prepare('food')\
.cook('food')\
.serve('food')\
.eat('food')\
.cleanup('dishes')

In the preceding code, the person is the object calling each of the methods, just as the person is performing all of the actions in the original sentence. The parameter passed to each of the methods specifies how the method operates.

Although it is possible to write the entire method chain in a single unbroken line, it is far more palatable to write a single method per line. Since Python does not normally allow a single expression to be written on multiple lines, you need to use the backslash line continuation character. Alternatively, you may wrap the whole expression in parentheses. To improve readability even more, place each method directly under the dot above it. This recipe shows similar method chaining with pandas Series.

How to do it...

  1. Load in the movie dataset, and select two columns as a distinct Series:
>>> movie = pd.read_csv('data/movie.csv')
>>> actor_1_fb_likes = movie['actor_1_facebook_likes']
>>> director = movie['director_name']
  1. One of the most common methods to append to the chain is the head method. This suppresses long output. For shorter chains, there isn't as great a need to place each method on a different line:
>>> director.value_counts().head(3)
Steven Spielberg 26 Woody Allen 22 Clint Eastwood 20 Name: director_name, dtype: int64
  1. A common way to count the number of missing values is to chain the sum method after isnull:
>>> actor_1_fb_likes.isnull().sum()
7
  1. All the non-missing values of actor_1_fb_likes should be integers as it is impossible to have a partial Facebook like. Any numeric columns with missing values must have their data type as float. If we fill missing values from actor_1_fb_likes with zeros, we can then convert it to an integer with the astype method:
>>> actor_1_fb_likes.dtype
dtype('float64')

>>> actor_1_fb_likes.fillna(0)\
.astype(int)\
.head()
0 1000 1 40000 2 11000 3 27000 4 131 Name: actor_1_facebook_likes, dtype: int64

How it works...

Method chaining is possible with all Python objects since each object method must return another object that itself will have more methods. It is not necessary for the method to return the same type of object.

Step 2 first uses value_counts to return a Series and then chains the head method to select the first three elements. The final returned object is a Series, which could also have had more methods chained on it.

In step 3, the isnull method creates a boolean Series. Pandas numerically evaluates False/True as 0/1, so the sum method returns the number of missing values.

Each of the three chained methods in step 4 returns a Series. It may not seem intuitive, but the astype method returns an entirely new Series with a different data type

There's more...

Instead of summing up the booleans in step 3 to find the total number of missing values, we can take the mean of the Series to get the percentage of values that are missing:

>>> actor_1_fb_likes.isnull().mean()
0.0014

As was mentioned at the beginning of the recipe, it is possible to use parentheses instead of the backslash for multi-line code. Step 4 may be rewritten this way:

>>> (actor_1_fb_likes.fillna(0)
.astype(int)
.head())

Not all programmers like the use of method chaining, as there are some downsides. One such downside is that debugging becomes difficult. None of the intermediate objects produced during the chain are stored in a variable, so if there is an unexpected result, it will be difficult to trace the exact location in the chain where it occurred.

The example at the start of the recipe may be rewritten so that the result of each method gets preserved as/in a unique variable. This makes tracking bugs much easier, as you can inspect the object at each step:

>>> person1 = person.drive('store')
>>> person2 = person1.buy('food')
>>> person3 = person2.drive('home')
>>> person4 = person3.prepare('food')
>>> person5 = person4.cook('food')
>>> person6 = person5.serve('food')
>>> person7 = person6.eat('food')
>>> person8 = person7.cleanup('dishes')