Pandas Cookbook

By : Theodore Petrou

Pandas Cookbook

By: Theodore Petrou

Overview of this book

This book will provide you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas 0.20. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands like one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through practical situations that you are highly likely to encounter. Many advanced recipes combine several different features across the pandas 0.20 library to generate results.

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Assumptions for every recipe

Free Chapter

Pandas Foundations

Introduction

Dissecting the anatomy of a DataFrame

Accessing the main DataFrame components

Understanding data types

Selecting a single column of data as a Series

Calling Series methods

Working with operators on a Series

Chaining Series methods together

Making the index meaningful

Renaming row and column names

Creating and deleting columns

Essential DataFrame Operations

Introduction

Selecting multiple DataFrame columns

Selecting columns with methods

Ordering column names sensibly

Operating on the entire DataFrame

Chaining DataFrame methods together

Working with operators on a DataFrame

Comparing missing values

Transposing the direction of a DataFrame operation

Determining college campus diversity

Beginning Data Analysis

Introduction

Developing a data analysis routine

Reducing memory by changing data types

Selecting the smallest of the largest

Selecting the largest of each group by sorting

Replicating nlargest with sort_values

Calculating a trailing stop order price

Selecting Subsets of Data

Introduction

Selecting Series data

Selecting DataFrame rows

Selecting DataFrame rows and columns simultaneously

Selecting data with both integers and labels

Speeding up scalar selection

Slicing rows lazily

Slicing lexicographically

Boolean Indexing

Introduction

Calculating boolean statistics

Constructing multiple boolean conditions

Filtering with boolean indexing

Replicating boolean indexing with index selection

Selecting with unique and sorted indexes

Gaining perspective on stock prices

Translating SQL WHERE clauses

Determining the normality of stock market returns

Improving readability of boolean indexing with the query method

Preserving Series with the where method

Masking DataFrame rows

Selecting with booleans, integer location, and labels

Index Alignment

Introduction

Examining the Index object

Producing Cartesian products

Exploding indexes

Filling values with unequal indexes

Appending columns from different DataFrames

Highlighting the maximum value from each column

Replicating idxmax with method chaining

Finding the most common maximum

Grouping for Aggregation, Filtration, and Transformation

Introduction

Defining an aggregation

Grouping and aggregating with multiple columns and functions

Removing the MultiIndex after grouping

Customizing an aggregation function

Customizing aggregating functions with *args and **kwargs

Examining the groupby object

Filtering for states with a minority majority

Transforming through a weight loss bet

Calculating weighted mean SAT scores per state with apply

Grouping by continuous variables

Counting the total number of flights between cities

Finding the longest streak of on-time flights

Restructuring Data into a Tidy Form

Introduction

Tidying variable values as column names with stack

Tidying variable values as column names with melt

Stacking multiple groups of variables simultaneously

Inverting stacked data

Unstacking after a groupby aggregation

Replicating pivot_table with a groupby aggregation

Renaming axis levels for easy reshaping

Tidying when multiple variables are stored as column names

Tidying when multiple variables are stored as column values

Tidying when two or more values are stored in the same cell

Tidying when variables are stored in column names and values

Tidying when multiple observational units are stored in the same table

Combining Pandas Objects

Introduction

Appending new rows to DataFrames

Concatenating multiple DataFrames together

Comparing President Trump's and Obama's approval ratings

Understanding the differences between concat, join, and merge

Connecting to SQL databases

Time Series Analysis

Introduction

Understanding the difference between Python and pandas date tools

Slicing time series intelligently

Using methods that only work with a DatetimeIndex

Counting the number of weekly crimes

Aggregating weekly crime and traffic accidents separately

Measuring crime by weekday and year

Grouping with anonymous functions with a DatetimeIndex

Grouping by a Timestamp and another column

Finding the last time crime was 20% lower with merge_asof

Visualization with Matplotlib, Pandas, and Seaborn

Introduction

Getting started with matplotlib

Visualizing data with matplotlib

Plotting basics with pandas

Visualizing the flights dataset

Stacking area charts to discover emerging trends

Understanding the differences between seaborn and pandas

Doing multivariate analysis with seaborn Grids

Uncovering Simpson's paradox in the diamonds dataset with seaborn

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Creating and deleting columns

During a data analysis, it is extremely likely that you will need to create new columns to represent new variables. Commonly, these new columns will be created from previous columns already in the dataset. Pandas has a few different ways to add new columns to a DataFrame.

Getting ready

In this recipe, we create new columns in the movie dataset by using the assignment and then delete columns with the drop method.

How to do it...

The simplest way to create a new column is to assign it a scalar value. Place the name of the new column as a string into the indexing operator. Let's create the has_seen column in the movie dataset to indicate whether or not we have seen the movie. We will assign zero for every value. By default, new columns are appended to the end:

>>> movie = pd.read_csv('data/movie.csv')
>>> movie['has_seen'] = 0

There are several columns that contain data on the number of Facebook likes. Let's add up all the actor and director Facebook likes and assign them to the actor_director_facebook_likes column:

>>> movie['actor_director_facebook_likes'] =  \
        (movie['actor_1_facebook_likes'] + 
         movie['actor_2_facebook_likes'] + 
         movie['actor_3_facebook_likes'] + 
         movie['director_facebook_likes'])

From the Calling Series method recipe in this chapter, we know that this dataset contains missing values. When numeric columns are added to one another as in the preceding step, pandas defaults missing values to zero. But, if all values for a particular row are missing, then pandas keeps the total as missing as well. Let's check if there are missing values in our new column and fill them with 0:

>>> movie['actor_director_facebook_likes'].isnull().sum()
122
>>> movie['actor_director_facebook_likes'] = \
    movie['actor_director_facebook_likes'].fillna(0)

There is another column in the dataset named cast_total_facebook_likes. It would be interesting to see what percentage of this column comes from our newly created column, actor_director_facebook_likes. Before we create our percentage column, let's do some basic data validation. Let's ensure that cast_total_facebook_likes is greater than or equal to actor_director_facebook_likes:

>>> movie['is_cast_likes_more'] = \
         (movie['cast_total_facebook_likes'] >=             
          movie['actor_director_facebook_likes'])

is_cast_likes_more is now a column of boolean values. We can check whether all the values of this column are True with the all Series method:

>>> movie['is_cast_likes_more'].all()
False

It turns out that there is at least one movie with more actor_director_facebook_likes than cast_total_facebook_likes. It could be that director Facebook likes are not part of the cast total likes. Let's backtrack and delete column actor_director_facebook_likes:

>>> movie = movie.drop('actor_director_facebook_likes',
                       axis='columns')

Let's recreate a column of just the total actor likes:

>>> movie['actor_total_facebook_likes'] = \
         (movie['actor_1_facebook_likes'] + 
          movie['actor_2_facebook_likes'] + 
          movie['actor_3_facebook_likes'])

>>> movie['actor_total_facebook_likes'] = \
         movie['actor_total_facebook_likes'].fillna(0)

Check again whether all the values in cast_total_facebook_likes are greater than the actor_total_facebook_likes:

>>> movie['is_cast_likes_more'] = \
         (movie['cast_total_facebook_likes'] >= 
          movie['actor_total_facebook_likes'])
    
>>> movie['is_cast_likes_more'].all()
True

Finally, let's calculate the percentage of the cast_total_facebook_likes that come from actor_total_facebook_likes:

>>> movie['pct_actor_cast_like'] = \
         (movie['actor_total_facebook_likes'] / 
          movie['cast_total_facebook_likes'])

Let's validate that the min and max of this column fall between 0 and 1:

>>> (movie['pct_actor_cast_like'].min(), 
     movie['pct_actor_cast_like'].max())
(0.0, 1.0)

We can then output this column as a Series. First, we need to set the index to the movie title so we can properly identify each value.

>>> movie.set_index('movie_title')['pct_actor_cast_like'].head()
movie_title
Avatar                                        0.577369
Pirates of the Caribbean: At World's End      0.951396
Spectre                                       0.987521
The Dark Knight Rises                         0.683783
Star Wars: Episode VII - The Force Awakens    0.000000
Name: pct_actor_cast_like, dtype: float64

How it works...

Many pandas operations are flexible, and column creation is one of them. This recipe assigns both a scalar value, as seen in Step 1, and a Series, as seen in step 2, to create a new column.

Step 2 adds four different Series together with the plus operator. Step 3 uses method chaining to find and fill missing values. Step 4 uses the greater than or equal comparison operator to return a boolean Series, which is then evaluated with the all method in step 5 to check whether every single value is True or not.

The drop method accepts the name of the row or column to delete. It defaults to dropping rows by the index names. To drop columns you must set the axis parameter to either 1 or columns. The default value for axis is 0 or the string index.

Steps 7 and 8 redo the work of step 3 to step 5 without the director_facebook_likes column. Step 9 finally calculates the desired column we wanted since step 4. Step 10 validates that the percentages are between 0 and 1.

There's more...

It is possible to insert a new column into a specific place in a DataFrame besides the end with the insert method. The insert method takes the integer position of the new column as its first argument, the name of the new column as its second, and the values as its third. You will need to use the get_loc Index method to find the integer location of the column name.

The insert method modifies the calling DataFrame in-place, so there won't be an assignment statement. The profit of each movie may be calculated by subtracting budget from gross and inserting it directly after gross with the following:

>>> profit_index = movie.columns.get_loc('gross') + 1
>>> profit_index
9

>>> movie.insert(loc=profit_index,
                 column='profit',
                 value=movie['gross'] - movie['budget'])

An alternative to deleting columns with the drop method is to use the del statement:

>>> del movie['actor_director_facebook_likes']

Pandas Cookbook

By : Theodore Petrou

Pandas Cookbook

By: Theodore Petrou

Overview of this book

Related Content you might be interested in

Current Title:

Pandas Cookbook

Learning pandas

Mastering Exploratory Analysis with pandas

Python Data Cleaning Cookbook