Pandas 1.x Cookbook - Second Edition

By : Matt Harrison, Theodore Petrou

Pandas 1.x Cookbook - Second Edition

By: Matt Harrison, Theodore Petrou

Overview of this book

The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands as one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through situations that you are highly likely to encounter. This new updated and revised edition provides you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. Many advanced recipes combine several different features across the pandas library to generate results.

Preface

Who this book is for

What this book covers

To get the most out of this book

Pandas Foundations

Understanding data types

Selecting a column

Calling Series methods

Series operations

Chaining Series methods

Renaming column names

Creating and deleting columns

Free Chapter

Essential DataFrame Operations

Introduction

Selecting multiple DataFrame columns

Selecting columns with methods

Ordering column names

Summarizing a DataFrame

Chaining DataFrame methods

DataFrame operations

Comparing missing values

Transposing the direction of a DataFrame operation

Determining college campus diversity

Creating and Persisting DataFrames

Introduction

Creating DataFrames from scratch

Writing CSV

Reading large CSV files

Using Excel files

Working with ZIP files

Working with databases

Reading JSON

Reading HTML tables

Beginning Data Analysis

Introduction

Developing a data analysis routine

Data dictionaries

Reducing memory by changing data types

Selecting the smallest of the largest

Selecting the largest of each group by sorting

Replicating nlargest with sort_values

Calculating a trailing stop order price

Exploratory Data Analysis

Comparing continuous values across categories

Comparing two continuous columns

Comparing categorical values with categorical values

Using the pandas profiling library

Selecting Subsets of Data

Introduction

Selecting Series data

Selecting DataFrame rows

Selecting DataFrame rows and columns simultaneously

Selecting data with both integers and labels

Slicing lexicographically

Filtering Rows

Introduction

Calculating Boolean statistics

Constructing multiple Boolean conditions

Filtering with Boolean arrays

Comparing row filtering and index filtering

Selecting with unique and sorted indexes

Translating SQL WHERE clauses

Improving the readability of Boolean indexing with the query method

Preserving Series size with the .where method

Masking DataFrame rows

Selecting with Booleans, integer location, and labels

Index Alignment

Introduction

Examining the Index object

Producing Cartesian products

Exploding indexes

Filling values with unequal indexes

Adding columns from different DataFrames

Highlighting the maximum value from each column

Replicating idxmax with method chaining

Finding the most common maximum of columns

Grouping for Aggregation, Filtration, and Transformation

Introduction

Defining an aggregation

Grouping and aggregating with multiple columns and functions

Removing the MultiIndex after grouping

Grouping with a custom aggregation function

Customizing aggregating functions with *args and **kwargs

Examining the groupby object

Filtering for states with a minority majority

Transforming through a weight loss bet

Calculating weighted mean SAT scores per state with apply

Grouping by continuous variables

Counting the total number of flights between cities

Finding the longest streak of on-time flights

Restructuring Data into a Tidy Form

Introduction

Tidying variable values as column names with stack

Tidying variable values as column names with melt

Stacking multiple groups of variables simultaneously

Inverting stacked data

Unstacking after a groupby aggregation

Replicating pivot_table with a groupby aggregation

Renaming axis levels for easy reshaping

Tidying when multiple variables are stored as column names

Tidying when multiple variables are stored as a single column

Tidying when two or more values are stored in the same cell

Tidying when variables are stored in column names and values

Combining Pandas Objects

Introduction

Appending new rows to DataFrames

Concatenating multiple DataFrames together

Understanding the differences between concat, join, and merge

Connecting to SQL databases

Time Series Analysis

Introduction

Understanding the difference between Python and pandas date tools

Slicing time series intelligently

Filtering columns with time data

Using methods that only work with a DatetimeIndex

Counting the number of weekly crimes

Aggregating weekly crime and traffic accidents separately

Measuring crime by weekday and year

Grouping with anonymous functions with a DatetimeIndex

Grouping by a Timestamp and another column

Visualization with Matplotlib, Pandas, and Seaborn

Introduction

Getting started with matplotlib

Object-oriented guide to matplotlib

Visualizing data with matplotlib

Plotting basics with pandas

Visualizing the flights dataset

Stacking area charts to discover emerging trends

Understanding the differences between seaborn and pandas

Multivariate analysis with seaborn Grids

Uncovering Simpson's Paradox in the diamonds dataset with seaborn

Debugging and Testing Pandas

Code to transform data

Apply performance

Improving apply performance with Dask, Pandarell, Swifter, and more

Inspecting code

Debugging in Jupyter

Managing data integrity with Great Expectations

Using pytest with pandas

Generating tests with Hypothesis

Other Books You May Enjoy

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Ordering column names

One of the first tasks to consider after initially importing a dataset as a DataFrame is to analyze the order of the columns. As humans we are used to reading languages from left to right, which impacts our interpretations of the data. It's far easier to find and interpret information when column order is given consideration.

There are no standardized set of rules that dictate how columns should be organized within a dataset. However, it is good practice to develop a set of guidelines that you consistently follow. This is especially true if you work with a group of analysts who share lots of datasets.

The following is a guideline to order columns:

Classify each column as either categorical or continuous
Group common columns within the categorical and continuous columns
Place the most important groups of columns first with categorical columns before continuous ones

This recipe shows you how to order the columns with this guideline. There are many possible orderings that are sensible.

How to do it...

Read in the movie dataset, and scan the data:

>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
...     return col.replace("facebook_likes", "fb").replace(
...         "_for_reviews", ""
...     )
>>> movies = movies.rename(columns=shorten)

Output all the column names and scan for similar categorical and continuous columns:

>>> movies.columns
Index(['color', 'director_name', 'num_critic', 'duration', 'director_fb',
       'actor_3_fb', 'actor_2_name', 'actor_1_fb', 'gross', 'genres',
       'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_fb',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user', 'language', 'country', 'content_rating',
       'budget', 'title_year', 'actor_2_fb', 'imdb_score', 'aspect_ratio',
       'movie_fb'],
      dtype='object')

The columns don't appear to have any logical ordering to them. Organize the names sensibly into lists so that the guideline from the previous section is followed:

>>> cat_core = [
...     "movie_title",
...     "title_year",
...     "content_rating",
...     "genres",
... ]
>>> cat_people = [
...     "director_name",
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
... ]
>>> cat_other = [
...     "color",
...     "country",
...     "language",
...     "plot_keywords",
...     "movie_imdb_link",
... ]
>>> cont_fb = [
...     "director_fb",
...     "actor_1_fb",
...     "actor_2_fb",
...     "actor_3_fb",
...     "cast_total_fb",
...     "movie_fb",
... ]
>>> cont_finance = ["budget", "gross"]
>>> cont_num_reviews = [
...     "num_voted_users",
...     "num_user",
...     "num_critic",
... ]
>>> cont_other = [
...     "imdb_score",
...     "duration",
...     "aspect_ratio",
...     "facenumber_in_poster",
... ]

Concatenate all the lists together to get the final column order. Also, ensure that this list contains all the columns from the original:

>>> new_col_order = (
...     cat_core
...     + cat_people
...     + cat_other
...     + cont_fb
...     + cont_finance
...     + cont_num_reviews
...     + cont_other
... )
>>> set(movies.columns) == set(new_col_order)
True

Pass the list with the new column order to the indexing operator of the DataFrame to reorder the columns:

>>> movies[new_col_order].head()
   movie_title  title_year  ... aspect_ratio facenumber_in_poster
0       Avatar      2009.0  ...         1.78          0.0
1  Pirates ...      2007.0  ...         2.35          0.0
2      Spectre      2015.0  ...         2.35          1.0
3  The Dark...      2012.0  ...         2.35          0.0
4  Star War...         NaN  ...          NaN          0.0

How it works...

You can select a subset of columns from a DataFrame, with a list of specific column names. For instance, movies[['movie_title', 'director_name']] creates a new DataFrame with only the movie_title and director_name columns. Selecting columns by name is the default behavior of the index operator for a pandas DataFrame.

Step 3 neatly organizes all of the column names into separate lists based on their type (categorical or continuous) and by how similar their data is. The most important columns, such as the title of the movie, are placed first.

Step 4 concatenates all of the lists of column names and validates that this new list contains the same exact values as the original column names. Python sets are unordered and the equality statement checks whether each member of one set is a member of the other. Manually ordering columns in this recipe is susceptible to human error as it's easy to mistakenly forget a column in the new column list.

Step 5 completes the reordering by passing the new column order as a list to the indexing operator. This new order is now much more sensible than the original.

There's more...

There are alternative guidelines for ordering columns besides the suggestion mentioned earlier. Hadley Wickham's seminal paper on Tidy Data suggests placing the fixed variables first, followed by measured variables. As this data does not come from a controlled experiment, there is some flexibility in determining which variables are fixed and which ones are measured. Good candidates for measured variables are those that we would like to predict, such as gross, the budget, or the imdb_score. For instance, in this ordering, we can mix categorical and continuous variables. It might make more sense to place the column for the number of Facebook likes directly after the name of that actor. You can, of course, come up with your own guidelines for column order as the computational parts are unaffected by it.

Pandas 1.x Cookbook - Second Edition

By : Matt Harrison, Theodore Petrou

Pandas 1.x Cookbook - Second Edition

By: Matt Harrison, Theodore Petrou

Overview of this book

Related Content you might be interested in

Current Title:

Pandas 1.x Cookbook - Second Edition

Learning pandas

Python Data Cleaning Cookbook

Mastering Exploratory Analysis with pandas

Ordering column names

How to do it...

How it works...

There's more...