Pandas 1.x Cookbook - Second Edition

By : Matt Harrison, Theodore Petrou

Pandas 1.x Cookbook - Second Edition

By: Matt Harrison, Theodore Petrou

Overview of this book

The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands as one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through situations that you are highly likely to encounter. This new updated and revised edition provides you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. Many advanced recipes combine several different features across the pandas library to generate results.

Preface

Who this book is for

What this book covers

To get the most out of this book

Pandas Foundations

Understanding data types

Selecting a column

Calling Series methods

Series operations

Chaining Series methods

Renaming column names

Creating and deleting columns

Free Chapter

Essential DataFrame Operations

Introduction

Selecting multiple DataFrame columns

Selecting columns with methods

Ordering column names

Summarizing a DataFrame

Chaining DataFrame methods

DataFrame operations

Comparing missing values

Transposing the direction of a DataFrame operation

Determining college campus diversity

Creating and Persisting DataFrames

Introduction

Creating DataFrames from scratch

Writing CSV

Reading large CSV files

Using Excel files

Working with ZIP files

Working with databases

Reading JSON

Reading HTML tables

Beginning Data Analysis

Introduction

Developing a data analysis routine

Data dictionaries

Reducing memory by changing data types

Selecting the smallest of the largest

Selecting the largest of each group by sorting

Replicating nlargest with sort_values

Calculating a trailing stop order price

Exploratory Data Analysis

Comparing continuous values across categories

Comparing two continuous columns

Comparing categorical values with categorical values

Using the pandas profiling library

Selecting Subsets of Data

Introduction

Selecting Series data

Selecting DataFrame rows

Selecting DataFrame rows and columns simultaneously

Selecting data with both integers and labels

Slicing lexicographically

Filtering Rows

Introduction

Calculating Boolean statistics

Constructing multiple Boolean conditions

Filtering with Boolean arrays

Comparing row filtering and index filtering

Selecting with unique and sorted indexes

Translating SQL WHERE clauses

Improving the readability of Boolean indexing with the query method

Preserving Series size with the .where method

Masking DataFrame rows

Selecting with Booleans, integer location, and labels

Index Alignment

Introduction

Examining the Index object

Producing Cartesian products

Exploding indexes

Filling values with unequal indexes

Adding columns from different DataFrames

Highlighting the maximum value from each column

Replicating idxmax with method chaining

Finding the most common maximum of columns

Grouping for Aggregation, Filtration, and Transformation

Introduction

Defining an aggregation

Grouping and aggregating with multiple columns and functions

Removing the MultiIndex after grouping

Grouping with a custom aggregation function

Customizing aggregating functions with *args and **kwargs

Examining the groupby object

Filtering for states with a minority majority

Transforming through a weight loss bet

Calculating weighted mean SAT scores per state with apply

Grouping by continuous variables

Counting the total number of flights between cities

Finding the longest streak of on-time flights

Restructuring Data into a Tidy Form

Introduction

Tidying variable values as column names with stack

Tidying variable values as column names with melt

Stacking multiple groups of variables simultaneously

Inverting stacked data

Unstacking after a groupby aggregation

Replicating pivot_table with a groupby aggregation

Renaming axis levels for easy reshaping

Tidying when multiple variables are stored as column names

Tidying when multiple variables are stored as a single column

Tidying when two or more values are stored in the same cell

Tidying when variables are stored in column names and values

Combining Pandas Objects

Introduction

Appending new rows to DataFrames

Concatenating multiple DataFrames together

Understanding the differences between concat, join, and merge

Connecting to SQL databases

Time Series Analysis

Introduction

Understanding the difference between Python and pandas date tools

Slicing time series intelligently

Filtering columns with time data

Using methods that only work with a DatetimeIndex

Counting the number of weekly crimes

Aggregating weekly crime and traffic accidents separately

Measuring crime by weekday and year

Grouping with anonymous functions with a DatetimeIndex

Grouping by a Timestamp and another column

Visualization with Matplotlib, Pandas, and Seaborn

Introduction

Getting started with matplotlib

Object-oriented guide to matplotlib

Visualizing data with matplotlib

Plotting basics with pandas

Visualizing the flights dataset

Stacking area charts to discover emerging trends

Understanding the differences between seaborn and pandas

Multivariate analysis with seaborn Grids

Uncovering Simpson's Paradox in the diamonds dataset with seaborn

Debugging and Testing Pandas

Code to transform data

Apply performance

Improving apply performance with Dask, Pandarell, Swifter, and more

Inspecting code

Debugging in Jupyter

Managing data integrity with Great Expectations

Using pytest with pandas

Generating tests with Hypothesis

Other Books You May Enjoy

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Selecting columns with methods

Although column selection is usually done with the indexing operator, there are some DataFrame methods that facilitate their selection in an alternative manner. The .select_dtypes and .filter methods are two useful methods to do this.

If you want to select by type, you need to be familiar with pandas data types. The Understanding data types recipe in Chapter 1, Pandas Foundations, explains the types.

How to do it...

Read in the movie dataset. Shorten the column names for display. Use the .get_dtype_counts method to output the number of columns with each specific data type:

>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
...     return (
...         str(col)
...         .replace("facebook_likes", "fb")
...         .replace("_for_reviews", "")
...     )
>>> movies = movies.rename(columns=shorten)
>>> movies.dtypes.value_counts()
float64    13
int64       3
object     12
dtype: int64

Use the .select_dtypes method to select only the integer columns:

>>> movies.select_dtypes(include="int").head()
   num_voted_users  cast_total_fb  movie_fb
0           886204           4834     33000
1           471220          48350         0
2           275868          11700     85000
3          1144337         106759    164000
4                8            143         0

If you would like to select all the numeric columns, you may pass the string number to the include parameter:

>>> movies.select_dtypes(include="number").head()
   num_critics  duration  ...  aspect_ratio  movie_fb
0        723.0     178.0  ...         1.78      33000
1        302.0     169.0  ...         2.35          0
2        602.0     148.0  ...         2.35      85000
3        813.0     164.0  ...         2.35     164000
4          NaN       NaN  ...          NaN          0

If we wanted integer and string columns we could do the following:

>>> movies.select_dtypes(include=["int", "object"]).head()
   color        direc/_name  ... conte/ating movie_fb
0  Color      James Cameron  ...       PG-13    33000
1  Color     Gore Verbinski  ...       PG-13        0
2  Color         Sam Mendes  ...       PG-13    85000
3  Color  Christopher Nolan  ...       PG-13   164000
4    NaN        Doug Walker  ...         NaN        0

To exclude only floating-point columns, do the following:

>>> movies.select_dtypes(exclude="float").head()
   color director_name  ... content_rating movie_fb
0  Color  James Ca...   ...        PG-13      33000
1  Color  Gore Ver...   ...        PG-13          0
2  Color   Sam Mendes   ...        PG-13      85000
3  Color  Christop...   ...        PG-13     164000
4    NaN  Doug Walker   ...          NaN          0

An alternative method to select columns is with the .filter method. This method is flexible and searches column names (or index labels) based on which parameter is used. Here, we use the like parameter to search for all the Facebook columns or the names that contain the exact string, fb. The like parameter is checking for substrings in column names:

>>> movies.filter(like="fb").head()
   director_fb  actor_3_fb  ...  actor_2_fb  movie_fb
0          0.0       855.0  ...       936.0     33000
1        563.0      1000.0  ...      5000.0         0
2          0.0       161.0  ...       393.0     85000
3      22000.0     23000.0  ...     23000.0    164000
4        131.0         NaN  ...        12.0         0

The .filter method has more tricks (or parameters) up its sleeve. If you use the items parameters, you can pass in a list of column names:

>>> cols = [
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
...     "director_name",
... ]
>>> movies.filter(items=cols).head()
      actor_1_name  ...      director_name
0      CCH Pounder  ...      James Cameron
1      Johnny Depp  ...     Gore Verbinski
2  Christoph Waltz  ...         Sam Mendes
3        Tom Hardy  ...  Christopher Nolan
4      Doug Walker  ...        Doug Walker

The .filter method allows columns to be searched with regular expressions using the regex parameter. Here, we search for all columns that have a digit somewhere in their name:

>>> movies.filter(regex=r"\d").head()
   actor_3_fb actor_2_name  ...  actor_3_name actor_2_fb
0       855.0  Joel Dav...  ...    Wes Studi       936.0
1      1000.0  Orlando ...  ...  Jack Dav...      5000.0
2       161.0  Rory Kin...  ...  Stephani...       393.0
3     23000.0  Christia...  ...  Joseph G...     23000.0
4         NaN   Rob Walker  ...          NaN        12.0

How it works...

Step 1 lists the frequencies of all the different data types. Alternatively, you may use the .dtypes attribute to get the exact data type for each column. The .select_dtypes method accepts either a list or single data type in its include or exclude parameters and returns a DataFrame with columns of just those given data types (or not those types if excluding columns). The list values may be either the string name of the data type or the actual Python object.

The .filter method selects columns by only inspecting the column names and not the actual data values. It has three mutually exclusive parameters: items, like, and regex, only one of which can be used at a time.

The like parameter takes a string and attempts to find all the column names that contain that exact string somewhere in the name. To gain more flexibility, you may use the regex parameter instead to select column names through a regular expression. This particular regular expression, r'\d', represents all digits from zero to nine and matches any string with at least a single digit in it.

The filter method comes with another parameter, items, which takes a list of exact column names. This is nearly an exact duplication of the index operation, except that a KeyError will not be raised if one of the strings does not match a column name. For instance, movies.filter(items=['actor_1_name', 'asdf']) runs without error and returns a single column DataFrame.

There's more...

One confusing aspect of .select_dtypes is its flexibility to take both strings and Python objects. The following list should clarify all the possible ways to select the many different column data types. There is no standard or preferred method of referring to data types in pandas, so it's good to be aware of both ways:

np.number, 'number' – Selects both integers and floats regardless of size
np.float64, np.float_, float, 'float64', 'float_', 'float' – Selects only 64-bit floats
np.float16, np.float32, np.float128, 'float16', 'float32', 'float128' – Respectively selects exactly 16, 32, and 128-bit floats
np.floating, 'floating' – Selects all floats regardless of size
np.int0, np.int64, np.int_, int, 'int0', 'int64', 'int_', 'int' – Selects only 64-bit integers
np.int8, np.int16, np.int32, 'int8', 'int16', 'int32' – Respectively selects exactly 8, 16, and 32-bit integers
np.integer, 'integer' – Selects all integers regardless of size
'Int64' – Selects nullable integer; no NumPy equivalent
np.object, 'object', 'O' – Select all object data types
np.datetime64, 'datetime64', 'datetime' – All datetimes are 64 bits
np.timedelta64, 'timedelta64', 'timedelta' – All timedeltas are 64 bits
pd.Categorical, 'category' – Unique to pandas; no NumPy equivalent

Because all integers and floats default to 64 bits, you may select them by using the string 'int' or 'float' as you can see from the preceding bullet list. If you want to select all integers and floats regardless of their specific size, use the string 'number'.

Pandas 1.x Cookbook - Second Edition

By : Matt Harrison, Theodore Petrou

Pandas 1.x Cookbook - Second Edition

By: Matt Harrison, Theodore Petrou

Overview of this book

Related Content you might be interested in

Current Title:

Pandas 1.x Cookbook - Second Edition

Learning pandas

Python Data Cleaning Cookbook

Mastering Exploratory Analysis with pandas

Selecting columns with methods

How to do it...

How it works...

There's more...