Pandas 1.x Cookbook - Second Edition

By : Matt Harrison, Theodore Petrou

Pandas 1.x Cookbook - Second Edition

By: Matt Harrison, Theodore Petrou

Overview of this book

The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands as one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through situations that you are highly likely to encounter. This new updated and revised edition provides you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. Many advanced recipes combine several different features across the pandas library to generate results.

Preface

Who this book is for

What this book covers

To get the most out of this book

Pandas Foundations

Understanding data types

Selecting a column

Calling Series methods

Series operations

Chaining Series methods

Renaming column names

Creating and deleting columns

Free Chapter

Essential DataFrame Operations

Introduction

Selecting multiple DataFrame columns

Selecting columns with methods

Ordering column names

Summarizing a DataFrame

Chaining DataFrame methods

DataFrame operations

Comparing missing values

Transposing the direction of a DataFrame operation

Determining college campus diversity

Creating and Persisting DataFrames

Introduction

Creating DataFrames from scratch

Writing CSV

Reading large CSV files

Using Excel files

Working with ZIP files

Working with databases

Reading JSON

Reading HTML tables

Beginning Data Analysis

Introduction

Developing a data analysis routine

Data dictionaries

Reducing memory by changing data types

Selecting the smallest of the largest

Selecting the largest of each group by sorting

Replicating nlargest with sort_values

Calculating a trailing stop order price

Exploratory Data Analysis

Comparing continuous values across categories

Comparing two continuous columns

Comparing categorical values with categorical values

Using the pandas profiling library

Selecting Subsets of Data

Introduction

Selecting Series data

Selecting DataFrame rows

Selecting DataFrame rows and columns simultaneously

Selecting data with both integers and labels

Slicing lexicographically

Filtering Rows

Introduction

Calculating Boolean statistics

Constructing multiple Boolean conditions

Filtering with Boolean arrays

Comparing row filtering and index filtering

Selecting with unique and sorted indexes

Translating SQL WHERE clauses

Improving the readability of Boolean indexing with the query method

Preserving Series size with the .where method

Masking DataFrame rows

Selecting with Booleans, integer location, and labels

Index Alignment

Introduction

Examining the Index object

Producing Cartesian products

Exploding indexes

Filling values with unequal indexes

Adding columns from different DataFrames

Highlighting the maximum value from each column

Replicating idxmax with method chaining

Finding the most common maximum of columns

Grouping for Aggregation, Filtration, and Transformation

Introduction

Defining an aggregation

Grouping and aggregating with multiple columns and functions

Removing the MultiIndex after grouping

Grouping with a custom aggregation function

Customizing aggregating functions with *args and **kwargs

Examining the groupby object

Filtering for states with a minority majority

Transforming through a weight loss bet

Calculating weighted mean SAT scores per state with apply

Grouping by continuous variables

Counting the total number of flights between cities

Finding the longest streak of on-time flights

Restructuring Data into a Tidy Form

Introduction

Tidying variable values as column names with stack

Tidying variable values as column names with melt

Stacking multiple groups of variables simultaneously

Inverting stacked data

Unstacking after a groupby aggregation

Replicating pivot_table with a groupby aggregation

Renaming axis levels for easy reshaping

Tidying when multiple variables are stored as column names

Tidying when multiple variables are stored as a single column

Tidying when two or more values are stored in the same cell

Tidying when variables are stored in column names and values

Combining Pandas Objects

Introduction

Appending new rows to DataFrames

Concatenating multiple DataFrames together

Understanding the differences between concat, join, and merge

Connecting to SQL databases

Time Series Analysis

Introduction

Understanding the difference between Python and pandas date tools

Slicing time series intelligently

Filtering columns with time data

Using methods that only work with a DatetimeIndex

Counting the number of weekly crimes

Aggregating weekly crime and traffic accidents separately

Measuring crime by weekday and year

Grouping with anonymous functions with a DatetimeIndex

Grouping by a Timestamp and another column

Visualization with Matplotlib, Pandas, and Seaborn

Introduction

Getting started with matplotlib

Object-oriented guide to matplotlib

Visualizing data with matplotlib

Plotting basics with pandas

Visualizing the flights dataset

Stacking area charts to discover emerging trends

Understanding the differences between seaborn and pandas

Multivariate analysis with seaborn Grids

Uncovering Simpson's Paradox in the diamonds dataset with seaborn

Debugging and Testing Pandas

Code to transform data

Apply performance

Improving apply performance with Dask, Pandarell, Swifter, and more

Inspecting code

Debugging in Jupyter

Managing data integrity with Great Expectations

Using pytest with pandas

Generating tests with Hypothesis

Other Books You May Enjoy

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Series operations

There exist a vast number of operators in Python for manipulating objects. For instance, when the plus operator is placed between two integers, Python will add them together:

>>> 5 + 9  # plus operator example. Adds 5 and 9
14

Series and DataFrames support many of the Python operators. Typically, a new Series or DataFrame is returned when using an operator.

In this recipe, a variety of operators will be applied to different Series objects to produce a new Series with completely different values.

How to do it…

Select the imdb_score column as a Series:

>>> movies = pd.read_csv("data/movie.csv")
>>> imdb_score = movies["imdb_score"]
>>> imdb_score
0       7.9
1       7.1
2       6.8
3       8.5
4       7.1
       ... 
4911    7.7
4912    7.5
4913    6.3
4914    6.3
4915    6.6
Name: imdb_score, Length: 4916, dtype: float64

Use the plus operator to add one to each Series element:

>>> imdb_score + 1
0       8.9
1       8.1
2       7.8
3       9.5
4       8.1
       ... 
4911    8.7
4912    8.5
4913    7.3
4914    7.3
4915    7.6
Name: imdb_score, Length: 4916, dtype: float64

The other basic arithmetic operators, minus (-), multiplication (*), division (/), and exponentiation (**) work similarly with scalar values. In this step, we will multiply the series by 2.5:

>>> imdb_score * 2.5
0       19.75
1       17.75
2       17.00
3       21.25
4       17.75
        ...  
4911    19.25
4912    18.75
4913    15.75
4914    15.75
4915    16.50
Name: imdb_score, Length: 4916, dtype: float64

Python uses a double slash (//) for floor division. The floor division operator truncates the result of the division. The percent sign (%) is the modulus operator, which returns the remainder after a division. The Series instances also support these operations:

>>> imdb_score // 7
0       1.0
1       1.0
2       0.0
3       1.0
4       1.0
       ... 
4911    1.0
4912    1.0
4913    0.0
4914    0.0
4915    0.0
Name: imdb_score, Length: 4916, dtype: float64

There exist six comparison operators, greater than (>), less than (<), greater than or equal to (>=), less than or equal to (<=), equal to (==), and not equal to (!=). Each comparison operator turns each value in the Series to True or False based on the outcome of the condition. The result is a Boolean array, which we will see is very useful for filtering in later recipes:

>>> imdb_score > 7
0        True
1        True
2       False
3        True
4        True
        ...  
4911     True
4912     True
4913    False
4914    False
4915    False
Name: imdb_score, Length: 4916, dtype: bool
>>> director = movies["director_name"]
>>> director == "James Cameron"
0        True
1       False
2       False
3       False
4       False
        ...  
4911    False
4912    False
4913    False
4914    False
4915    False
Name: director_name, Length: 4916, dtype: bool

How it works…

All the operators used in this recipe apply the same operation to each element in the Series. In native Python, this would require a for loop to iterate through each of the items in the sequence before applying the operation. pandas relies heavily on the NumPy library, which allows for vectorized computations, or the ability to operate on entire sequences of data without the explicit writing of for loops. Each operation returns a new Series with the same index, but with the new values.

There's more…

All of the operators used in this recipe have method equivalents that produce the exact same result. For instance, in step 1, imdb_score + 1 can be reproduced with the .add method.

Using the method rather than the operator can be useful when we chain methods together.

Here are a few examples:

>>> imdb_score.add(1)  # imdb_score + 1
0       8.9
1       8.1
2       7.8
3       9.5
4       8.1
       ... 
4911    8.7
4912    8.5
4913    7.3
4914    7.3
4915    7.6
Name: imdb_score, Length: 4916, dtype: float64
>>> imdb_score.gt(7)  # imdb_score > 7
0        True
1        True
2       False
3        True
4        True
        ...  
4911     True
4912     True
4913    False
4914    False
4915    False
Name: imdb_score, Length: 4916, dtype: bool

Why does pandas offer a method equivalent to these operators? By its nature, an operator only operates in exactly one manner. Methods, on the other hand, can have parameters that allow you to alter their default functionality.

Other recipes will dive into this further, but here is a small example. The .sub method performs subtraction on a Series. When you do subtraction with the - operator, missing values are ignored. However, the .sub method allows you to specify a fill_value parameter to use in place of missing values:

>>> money = pd.Series([100, 20, None])
>>> money – 15
0    85.0
1     5.0
2     NaN
dtype: float64
>>> money.sub(15, fill_value=0)
0    85.0
1     5.0
2   -15.0
dtype: float64

Following is a table of operators and the corresponding methods:

Operator group	Operator	Series method name
Arithmetic	`+`,`-`,``,`/`,`//`,`%`,`*`	`.add`, `.sub`, `.mul`, `.div`, `.floordiv`, `.mod`, `.pow`
Comparison	`<`,`>`,`<=`,`>=`,`==`,`!=`	`.lt`, `.gt`, `.le`, `.ge`, `.eq`, `.ne`

You may be curious as to how a Python Series object, or any object for that matter, knows what to do when it encounters an operator. For example, how does the expression imdb_score * 2.5 know to multiply each element in the Series by 2.5? Python has a built-in, standardized way for objects to communicate with operators using special methods.

Special methods are what objects call internally whenever they encounter an operator. Special methods always begin and end with two underscores. Because of this, they are also called dunder methods as the method that implements the operator is surrounded by double underscores (dunder being a lazy geeky programmer way of saying "double underscores"). For instance, the special method .__mul__ is called whenever the multiplication operator is used. Python interprets the imdb_score * 2.5 expression as imdb_score.__mul__(2.5).

There is no difference between using the special method and using an operator as they are doing the exact same thing. The operator is just syntactic sugar for the special method. However, calling the .mul method is different than calling the .__mul__ method.

Pandas 1.x Cookbook - Second Edition

By : Matt Harrison, Theodore Petrou

Pandas 1.x Cookbook - Second Edition

By: Matt Harrison, Theodore Petrou

Overview of this book

Related Content you might be interested in

Current Title:

Pandas 1.x Cookbook - Second Edition

Learning pandas

Python Data Cleaning Cookbook

Mastering Exploratory Analysis with pandas

Series operations

How to do it…

How it works…

There's more…