Pandas Cookbook

By : Theodore Petrou

Pandas Cookbook

By: Theodore Petrou

Overview of this book

This book will provide you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas 0.20. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands like one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through practical situations that you are highly likely to encounter. Many advanced recipes combine several different features across the pandas 0.20 library to generate results.

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Assumptions for every recipe

Free Chapter

Pandas Foundations

Introduction

Dissecting the anatomy of a DataFrame

Accessing the main DataFrame components

Understanding data types

Selecting a single column of data as a Series

Calling Series methods

Working with operators on a Series

Chaining Series methods together

Making the index meaningful

Renaming row and column names

Creating and deleting columns

Essential DataFrame Operations

Introduction

Selecting multiple DataFrame columns

Selecting columns with methods

Ordering column names sensibly

Operating on the entire DataFrame

Chaining DataFrame methods together

Working with operators on a DataFrame

Comparing missing values

Transposing the direction of a DataFrame operation

Determining college campus diversity

Beginning Data Analysis

Introduction

Developing a data analysis routine

Reducing memory by changing data types

Selecting the smallest of the largest

Selecting the largest of each group by sorting

Replicating nlargest with sort_values

Calculating a trailing stop order price

Selecting Subsets of Data

Introduction

Selecting Series data

Selecting DataFrame rows

Selecting DataFrame rows and columns simultaneously

Selecting data with both integers and labels

Speeding up scalar selection

Slicing rows lazily

Slicing lexicographically

Boolean Indexing

Introduction

Calculating boolean statistics

Constructing multiple boolean conditions

Filtering with boolean indexing

Replicating boolean indexing with index selection

Selecting with unique and sorted indexes

Gaining perspective on stock prices

Translating SQL WHERE clauses

Determining the normality of stock market returns

Improving readability of boolean indexing with the query method

Preserving Series with the where method

Masking DataFrame rows

Selecting with booleans, integer location, and labels

Index Alignment

Introduction

Examining the Index object

Producing Cartesian products

Exploding indexes

Filling values with unequal indexes

Appending columns from different DataFrames

Highlighting the maximum value from each column

Replicating idxmax with method chaining

Finding the most common maximum

Grouping for Aggregation, Filtration, and Transformation

Introduction

Defining an aggregation

Grouping and aggregating with multiple columns and functions

Removing the MultiIndex after grouping

Customizing an aggregation function

Customizing aggregating functions with *args and **kwargs

Examining the groupby object

Filtering for states with a minority majority

Transforming through a weight loss bet

Calculating weighted mean SAT scores per state with apply

Grouping by continuous variables

Counting the total number of flights between cities

Finding the longest streak of on-time flights

Restructuring Data into a Tidy Form

Introduction

Tidying variable values as column names with stack

Tidying variable values as column names with melt

Stacking multiple groups of variables simultaneously

Inverting stacked data

Unstacking after a groupby aggregation

Replicating pivot_table with a groupby aggregation

Renaming axis levels for easy reshaping

Tidying when multiple variables are stored as column names

Tidying when multiple variables are stored as column values

Tidying when two or more values are stored in the same cell

Tidying when variables are stored in column names and values

Tidying when multiple observational units are stored in the same table

Combining Pandas Objects

Introduction

Appending new rows to DataFrames

Concatenating multiple DataFrames together

Comparing President Trump's and Obama's approval ratings

Understanding the differences between concat, join, and merge

Connecting to SQL databases

Time Series Analysis

Introduction

Understanding the difference between Python and pandas date tools

Slicing time series intelligently

Using methods that only work with a DatetimeIndex

Counting the number of weekly crimes

Aggregating weekly crime and traffic accidents separately

Measuring crime by weekday and year

Grouping with anonymous functions with a DatetimeIndex

Grouping by a Timestamp and another column

Finding the last time crime was 20% lower with merge_asof

Visualization with Matplotlib, Pandas, and Seaborn

Introduction

Getting started with matplotlib

Visualizing data with matplotlib

Plotting basics with pandas

Visualizing the flights dataset

Stacking area charts to discover emerging trends

Understanding the differences between seaborn and pandas

Doing multivariate analysis with seaborn Grids

Uncovering Simpson's paradox in the diamonds dataset with seaborn

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Preface

The popularity of data science has skyrocketed since it was called The Sexiest Job of the 21st Century by the Harvard Review in 2012. It was ranked as the number one job by Glassdoor in both 2016 and 2017. Fueling this skyrocketing popularity for data science is the demand from industry. Several applications have made big splashes in the news, such as Netflix making better movie recommendations, IBM Watson defeating humans at Jeopardy, Tesla building self-driving cars, Major League Baseball teams finding undervalued prospects, and Google learning to identify cats on the internet.

Nearly every industry is finding ways to use data science to build new technology or provide deeper insights. Due to such noteworthy successes, an ever-present aura of hype seems to encapsulate data science. Most of the scientific progress backing this hype stems from the field of machine learning, which produces the algorithms that make the predictions responsible for artificial intelligence.

The fundamental building block for all machine learning algorithms is, of course, data. As companies have realized this, there is no shortage of it. The business intelligence company, Domo, estimates that 90% of the world's data has been created in just the last two years. Although machine learning gets all the attention, it is completely reliant on the quality of the data that it is fed. Before data ever reaches the input layers of a machine learning algorithm, it must be prepared, and for data to be prepared properly, it needs to be explored thoroughly for basic understanding and to identify inaccuracies. Before data can be explored, it needs to be captured.

To summarize, we can cast the data science pipeline into three stages--data capturing, data exploration, and machine learning. There are a vast array of tools available to complete each stage of the pipeline. Pandas is the dominant tool in the scientific Python ecosystem for data exploration and analysis. It is tremendously capable of inspecting, cleaning, tidying, filtering, transforming, aggregating, and even visualizing (with some help) all types of data. It is not a tool for initially capturing the data, nor is it a tool to build machine learning models.

For many data analysts and scientists who use Python, the vast majority of their work will be done using pandas. This is likely because the initial data exploration and preparation tend to take the most time. Some entire projects consist only of data exploration and have no machine learning component. Data scientists spend so much time on this stage that a timeless lore has arisen--Data scientists spend 80% of their time cleaning the data and the other 20% complaining about cleaning the data.

Although there is an abundance of open source and free programming languages available to do data exploration, the field is currently dominated by just two players, Python and R. The two languages have vastly different syntax but are both very capable of doing data analysis and machine learning. One measure of popularity is the number of questions asked on the popular Q&A site, Stack Overflow (https://insights.stackoverflow.com/trends):

While this is not a true measure of usage, it is clear that both Python and R have become increasingly popular, likely due to their data science capabilities. It is interesting to note that the percentage of Python questions remained constant until the year 2012, when data science took off. What is probably most astonishing about this graph is that pandas questions now make up a whopping one percent of all the newest questions on Stack Overflow.

One of the reasons why Python has become a language of choice for data science is that it is a fairly easy language to learn and develop, and so it has a low barrier to entry. It is also free and open source, able to run on a variety of hardware and software, and a breeze to get up and running. It has a very large and active community with a substantial amount of free resources online. In my opinion, Python is one of the most fun languages to develop programs with. The syntax is so clear, concise, and intuitive but like all languages, takes quite a long time to master.

As Python was not built for data analysis like R, the syntax may not come as naturally as it does for some other Python libraries. This actually might be part of the reason why there are so many Stack Overflow questions on it. Despite its tremendous capabilities, pandas code can often be poorly written. One of the main aims of this book is to show performant and idiomatic pandas code.

For all its greatness, Stack Overflow, unfortunately perpetuates misinformation and is a source for lots of poorly written pandas. This is actually not the fault of Stack Overflow or its community. Pandas is an open source project and has had numerous major changes, even recently, as it approaches its tenth year of existence in 2018. The upside of open source, though, is that new features get added to it all the time.

The recipes in this book were formulated through my experience working as a data scientist, building and hosting several week-long data exploration bootcamps, answering several hundred questions on Stack Overflow, and building tutorials for my local meetup group. The recipes not only offer idiomatic solutions to common data problems, but also take you on journeys through many real-world datasets, where surprising insights are often discovered. These recipes will also help you master the pandas library, which will give you a gigantic boost in productivity. There is a huge difference between those who have only cursory knowledge of pandas and those who have it mastered. There are so many interesting and fun tricks to solve your data problems that only become apparent if you truly know the library inside and out. Personally, I find pandas to be a delightful and fun tool to analyze data with, and I hope you enjoy your journey along with me. If you have questions, please feel free to reach out to me on Twitter: @TedPetrou.

Pandas Cookbook

By : Theodore Petrou

Pandas Cookbook

By: Theodore Petrou

Overview of this book

Related Content you might be interested in

Current Title:

Pandas Cookbook

Learning pandas

Mastering Exploratory Analysis with pandas

Python Data Cleaning Cookbook