Pandas Cookbook

By : Theodore Petrou

Pandas Cookbook

By: Theodore Petrou

Overview of this book

This book will provide you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas 0.20. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands like one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through practical situations that you are highly likely to encounter. Many advanced recipes combine several different features across the pandas 0.20 library to generate results.

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Assumptions for every recipe

Free Chapter

Pandas Foundations

Introduction

Dissecting the anatomy of a DataFrame

Accessing the main DataFrame components

Understanding data types

Selecting a single column of data as a Series

Calling Series methods

Working with operators on a Series

Chaining Series methods together

Making the index meaningful

Renaming row and column names

Creating and deleting columns

Essential DataFrame Operations

Introduction

Selecting multiple DataFrame columns

Selecting columns with methods

Ordering column names sensibly

Operating on the entire DataFrame

Chaining DataFrame methods together

Working with operators on a DataFrame

Comparing missing values

Transposing the direction of a DataFrame operation

Determining college campus diversity

Beginning Data Analysis

Introduction

Developing a data analysis routine

Reducing memory by changing data types

Selecting the smallest of the largest

Selecting the largest of each group by sorting

Replicating nlargest with sort_values

Calculating a trailing stop order price

Selecting Subsets of Data

Introduction

Selecting Series data

Selecting DataFrame rows

Selecting DataFrame rows and columns simultaneously

Selecting data with both integers and labels

Speeding up scalar selection

Slicing rows lazily

Slicing lexicographically

Boolean Indexing

Introduction

Calculating boolean statistics

Constructing multiple boolean conditions

Filtering with boolean indexing

Replicating boolean indexing with index selection

Selecting with unique and sorted indexes

Gaining perspective on stock prices

Translating SQL WHERE clauses

Determining the normality of stock market returns

Improving readability of boolean indexing with the query method

Preserving Series with the where method

Masking DataFrame rows

Selecting with booleans, integer location, and labels

Index Alignment

Introduction

Examining the Index object

Producing Cartesian products

Exploding indexes

Filling values with unequal indexes

Appending columns from different DataFrames

Highlighting the maximum value from each column

Replicating idxmax with method chaining

Finding the most common maximum

Grouping for Aggregation, Filtration, and Transformation

Introduction

Defining an aggregation

Grouping and aggregating with multiple columns and functions

Removing the MultiIndex after grouping

Customizing an aggregation function

Customizing aggregating functions with *args and **kwargs

Examining the groupby object

Filtering for states with a minority majority

Transforming through a weight loss bet

Calculating weighted mean SAT scores per state with apply

Grouping by continuous variables

Counting the total number of flights between cities

Finding the longest streak of on-time flights

Restructuring Data into a Tidy Form

Introduction

Tidying variable values as column names with stack

Tidying variable values as column names with melt

Stacking multiple groups of variables simultaneously

Inverting stacked data

Unstacking after a groupby aggregation

Replicating pivot_table with a groupby aggregation

Renaming axis levels for easy reshaping

Tidying when multiple variables are stored as column names

Tidying when multiple variables are stored as column values

Tidying when two or more values are stored in the same cell

Tidying when variables are stored in column names and values

Tidying when multiple observational units are stored in the same table

Combining Pandas Objects

Introduction

Appending new rows to DataFrames

Concatenating multiple DataFrames together

Comparing President Trump's and Obama's approval ratings

Understanding the differences between concat, join, and merge

Connecting to SQL databases

Time Series Analysis

Introduction

Understanding the difference between Python and pandas date tools

Slicing time series intelligently

Using methods that only work with a DatetimeIndex

Counting the number of weekly crimes

Aggregating weekly crime and traffic accidents separately

Measuring crime by weekday and year

Grouping with anonymous functions with a DatetimeIndex

Grouping by a Timestamp and another column

Finding the last time crime was 20% lower with merge_asof

Visualization with Matplotlib, Pandas, and Seaborn

Introduction

Getting started with matplotlib

Visualizing data with matplotlib

Plotting basics with pandas

Visualizing the flights dataset

Stacking area charts to discover emerging trends

Understanding the differences between seaborn and pandas

Doing multivariate analysis with seaborn Grids

Uncovering Simpson's paradox in the diamonds dataset with seaborn

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Calling Series methods

Utilizing the single-dimensional Series is an integral part of all data analysis with pandas. A typical workflow will have you going back and forth between executing statements on Series and DataFrames. Calling Series methods is the primary way to use the abilities that the Series offers.

Getting ready

Both Series and DataFrames have a tremendous amount of power. We can use the dir function to uncover all the attributes and methods of a Series. Additionally, we can find the number of attributes and methods common to both Series and DataFrames. Both of these objects share the vast majority of attribute and method names:

>>> s_attr_methods = set(dir(pd.Series))
>>> len(s_attr_methods)
442

>>> df_attr_methods = set(dir(pd.DataFrame))
>>> len(df_attr_methods)
445

>>> len(s_attr_methods & df_attr_methods)
376

This recipe covers the most common and powerful Series methods. Many of the methods are nearly equivalent for DataFrames.

How to do it...

After reading in the movies dataset, let's select two Series with different data types. The director_name column contains strings, formally an object data type, and the column actor_1_facebook_likes contains numerical data, formally float64:

>>> movie = pd.read_csv('data/movie.csv')
>>> director = movie['director_name']
>>> actor_1_fb_likes = movie['actor_1_facebook_likes']

Inspect the head of each Series:

>>> director.head()
0        James Cameron
1       Gore Verbinski
2           Sam Mendes
3    Christopher Nolan
4          Doug Walker
Name: director_name, dtype: object

>>> actor_1_fb_likes.head()
0     1000.0
1    40000.0
2    11000.0
3    27000.0
4      131.0
Name: actor_1_facebook_likes, dtype: float64

The data type of the Series usually determines which of the methods will be the most useful. For instance, one of the most useful methods for the object data type Series is value_counts, which counts all the occurrences of each unique value:

>>> director.value_counts()
Steven Spielberg        26
Woody Allen             22
Martin Scorsese         20
Clint Eastwood          20
                        ..
Fatih Akin               1
Analeine Cal y Mayor     1
Andrew Douglas           1
Scott Speer              1
Name: director_name, Length: 2397, dtype: int64

The value_counts method is typically more useful for Series with object data types but can occasionally provide insight into numeric Series as well. Used with actor_1_fb_likes, it appears that higher numbers have been rounded to the nearest thousand as it is unlikely that so many movies received exactly 1,000 likes:

>>> actor_1_fb_likes.value_counts()
1000.0     436
11000.0    206
2000.0     189
3000.0     150
          ... 
216.0        1
859.0        1
225.0        1
334.0        1
Name: actor_1_facebook_likes, Length: 877, dtype: int64

Counting the number of elements in the Series may be done with the size or shape parameter or the len function:

>>> director.size
4916
>>> director.shape
(4916,)
>>> len(director)
4916

Additionally, there is the useful but confusing count method that returns the number of non-missing values:

>>> director.count()
4814
>>> actor_1_fb_likes.count()
4909

Basic summary statistics may be yielded with the min, max, mean, median, std, and sum methods:

>>> actor_1_fb_likes.min(), actor_1_fb_likes.max(), \
    actor_1_fb_likes.mean(), actor_1_fb_likes.median(), \
    actor_1_fb_likes.std(), actor_1_fb_likes.sum()
(0.0, 640000.0, 6494.488490527602, 982.0, 15106.98, 31881444.0)

To simplify step 7, you may use the describe method to return both the summary statistics and a few of the quantiles at once. When describe is used with an object data type column, a completely different output is returned:

>>> actor_1_fb_likes.describe()
count      4909.000000
mean       6494.488491
std       15106.986884
min           0.000000
25%         607.000000
50%         982.000000
75%       11000.000000
max      640000.000000
Name: actor_1_facebook_likes, dtype: float64

>>> director.describe()
count                 4814
unique                2397
top       Steven Spielberg
freq                    26
Name: director_name, dtype: object

The quantile method exists to calculate an exact quantile of numeric data:

>>> actor_1_fb_likes.quantile(.2)
510

>>> actor_1_fb_likes.quantile([.1, .2, .3, .4, .5,
                               .6, .7, .8, .9])
0.1      240.0
0.2      510.0
0.3      694.0
0.4      854.0
        ...   
0.6     1000.0
0.7     8000.0
0.8    13000.0
0.9    18000.0
Name: actor_1_facebook_likes, Length: 9, dtype: float64

Since the count method in step 6 returned a value less than the total number of Series elements found in step 5, we know that there are missing values in each Series. The isnull method may be used to determine whether each individual value is missing or not. The result will be a Series of booleans the same length as the original Series:

>>> director.isnull()
0       False
1       False
2       False
3       False
        ...  
4912     True
4913    False
4914    False
4915    False
Name: director_name, Length: 4916, dtype: bool

It is possible to replace all missing values within a Series with the fillna method:

>>> actor_1_fb_likes_filled = actor_1_fb_likes.fillna(0)
>>> actor_1_fb_likes_filled.count()
4916

To remove the Series elements with missing values, use dropna:

>>> actor_1_fb_likes_dropped = actor_1_fb_likes.dropna()
>>> actor_1_fb_likes_dropped.size
4909

How it works...

Passing a string to the indexing operator of a DataFrame selects a single column as a Series. The methods used in this recipe were chosen because of how frequently they are used in data analysis.

The steps in this recipe should be straightforward with easily interpretable output. Even though the output is easy to read, you might lose track of the returned object. Is it a scalar value, a tuple, another Series, or some other Python object? Take a moment, and look at the output returned after each step. Can you name the returned object?

The result from the head method in step 1 is another Series. The value_counts method also produces a Series but has the unique values from the original Series as the index and the count as its values. In step 5, size and count return scalar values, but shape returns a one-item tuple.

It seems odd that the shape attribute returns a one-item tuple, but this is convention borrowed from NumPy, which allows for arrays of any number of dimensions.

In step 7, each individual method returns a scalar value, and is outputted as a tuple. This is because Python treats an expression composed of only comma-separated values without parentheses as a tuple.

In step 8, describe returns a Series with all the summary statistic names as the index and the actual statistic as the values.

In step 9, quantile is flexible and returns a scalar value when passed a single value but returns a Series when given a list.

From steps 10, 11, and 12, isnull, fillna, and dropna all return a Series.

There's more...

The value_counts method is one of the most informative Series methods and heavily used during exploratory analysis, especially with categorical columns. It defaults to returning the counts, but by setting the normalize parameter to True, the relative frequencies are returned instead, which provides another view of the distribution:

>>> director.value_counts(normalize=True)
Steven Spielberg        0.005401
Woody Allen             0.004570
Martin Scorsese         0.004155
Clint Eastwood          0.004155
                          ...   
Fatih Akin              0.000208
Analeine Cal y Mayor    0.000208
Andrew Douglas          0.000208
Scott Speer             0.000208
Name: director_name, Length: 2397, dtype: float64

In this recipe, we determined that there were missing values in the Series by observing that the result from the count method did not match the size attribute. A more direct approach is to use the hasnans attribute:

>>> director.hasnans
True

There exists a complement of isnull: the notnull method, which returns True for all the non-missing values:

>>> director.notnull()
0        True
1        True
2        True
3        True
        ...  
4912    False
4913     True
4914     True
4915     True
Name: director_name, Length: 4916, dtype: bool

Pandas Cookbook

By : Theodore Petrou

Pandas Cookbook

By: Theodore Petrou

Overview of this book

Related Content you might be interested in

Current Title:

Pandas Cookbook

Learning pandas

Mastering Exploratory Analysis with pandas

Python Data Cleaning Cookbook