Book Image

Pandas Cookbook

By : Theodore Petrou
Book Image

Pandas Cookbook

By: Theodore Petrou

Overview of this book

This book will provide you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas 0.20. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands like one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through practical situations that you are highly likely to encounter. Many advanced recipes combine several different features across the pandas 0.20 library to generate results.
Table of Contents (12 chapters)

Creating and deleting columns

During a data analysis, it is extremely likely that you will need to create new columns to represent new variables. Commonly, these new columns will be created from previous columns already in the dataset. Pandas has a few different ways to add new columns to a DataFrame.

Getting ready

In this recipe, we create new columns in the movie dataset by using the assignment and then delete columns with the drop method.

How to do it...

  1. The simplest way to create a new column is to assign it a scalar value. Place the name of the new column as a string into the indexing operator. Let's create the has_seen column in the movie dataset to indicate whether or not we have seen the movie. We will assign zero for every value. By default, new columns are appended to the end:
>>> movie = pd.read_csv('data/movie.csv')
>>> movie['has_seen'] = 0
  1. There are several columns that contain data on the number of Facebook likes. Let's add up all the actor and director Facebook likes and assign them to the actor_director_facebook_likes column:
>>> movie['actor_director_facebook_likes'] =  \
(movie['actor_1_facebook_likes'] +
movie['actor_2_facebook_likes'] +
movie['actor_3_facebook_likes'] +
movie['director_facebook_likes'])
  1. From the Calling Series method recipe in this chapter, we know that this dataset contains missing values. When numeric columns are added to one another as in the preceding step, pandas defaults missing values to zero. But, if all values for a particular row are missing, then pandas keeps the total as missing as well. Let's check if there are missing values in our new column and fill them with 0:
>>> movie['actor_director_facebook_likes'].isnull().sum()
122
>>> movie['actor_director_facebook_likes'] = \
movie['actor_director_facebook_likes'].fillna(0)
  1. There is another column in the dataset named cast_total_facebook_likes. It would be interesting to see what percentage of this column comes from our newly created column, actor_director_facebook_likes. Before we create our percentage column, let's do some basic data validation. Let's ensure that cast_total_facebook_likes is greater than or equal to actor_director_facebook_likes:
>>> movie['is_cast_likes_more'] = \
(movie['cast_total_facebook_likes'] >=
movie['actor_director_facebook_likes'])
  1. is_cast_likes_more is now a column of boolean values. We can check whether all the values of this column are True with the all Series method:
>>> movie['is_cast_likes_more'].all()
False
  1. It turns out that there is at least one movie with more actor_director_facebook_likes than cast_total_facebook_likes. It could be that director Facebook likes are not part of the cast total likes. Let's backtrack and delete column actor_director_facebook_likes:
>>> movie = movie.drop('actor_director_facebook_likes',
axis='columns')
  1. Let's recreate a column of just the total actor likes:
>>> movie['actor_total_facebook_likes'] = \
(movie['actor_1_facebook_likes'] +
movie['actor_2_facebook_likes'] +
movie['actor_3_facebook_likes'])

>>> movie['actor_total_facebook_likes'] = \
movie['actor_total_facebook_likes'].fillna(0)
  1. Check again whether all the values in cast_total_facebook_likes are greater than the actor_total_facebook_likes:
>>> movie['is_cast_likes_more'] = \
(movie['cast_total_facebook_likes'] >=
movie['actor_total_facebook_likes'])

>>> movie['is_cast_likes_more'].all()
True
  1. Finally, let's calculate the percentage of the cast_total_facebook_likes that come from actor_total_facebook_likes:
>>> movie['pct_actor_cast_like'] = \
(movie['actor_total_facebook_likes'] /
movie['cast_total_facebook_likes'])
  1. Let's validate that the min and max of this column fall between 0 and 1:
>>> (movie['pct_actor_cast_like'].min(), 
movie['pct_actor_cast_like'].max())
(0.0, 1.0)
  1. We can then output this column as a Series. First, we need to set the index to the movie title so we can properly identify each value.
>>> movie.set_index('movie_title')['pct_actor_cast_like'].head()
movie_title Avatar 0.577369 Pirates of the Caribbean: At World's End 0.951396 Spectre 0.987521 The Dark Knight Rises 0.683783 Star Wars: Episode VII - The Force Awakens 0.000000 Name: pct_actor_cast_like, dtype: float64

How it works...

Many pandas operations are flexible, and column creation is one of them. This recipe assigns both a scalar value, as seen in Step 1, and a Series, as seen in step 2, to create a new column.

Step 2 adds four different Series together with the plus operator. Step 3 uses method chaining to find and fill missing values. Step 4 uses the greater than or equal comparison operator to return a boolean Series, which is then evaluated with the all method in step 5 to check whether every single value is True or not.

The drop method accepts the name of the row or column to delete. It defaults to dropping rows by the index names. To drop columns you must set the axis parameter to either 1 or columns. The default value for axis is 0 or the string index.

Steps 7 and 8 redo the work of step 3 to step 5 without the director_facebook_likes column. Step 9 finally calculates the desired column we wanted since step 4. Step 10 validates that the percentages are between 0 and 1.

There's more...

It is possible to insert a new column into a specific place in a DataFrame besides the end with the insert method. The insert method takes the integer position of the new column as its first argument, the name of the new column as its second, and the values as its third. You will need to use the get_loc Index method to find the integer location of the column name.

The insert method modifies the calling DataFrame in-place, so there won't be an assignment statement. The profit of each movie may be calculated by subtracting budget from gross and inserting it directly after gross with the following:

>>> profit_index = movie.columns.get_loc('gross') + 1
>>> profit_index
9

>>> movie.insert(loc=profit_index,
column='profit',
value=movie['gross'] - movie['budget'])

An alternative to deleting columns with the drop method is to use the del statement:

>>> del movie['actor_director_facebook_likes']

See also

  • Refer to the Appending new rows to DataFrames recipe from Chapter 9, Combining Pandas Objects for adding and deleting rows, which is a less common operation
  • Refer to the Developing a data analysis routine recipe from Chapter 3, Beginning Data Analysis