Learning pandas - Second Edition

By : Michael Heydt

Learning pandas - Second Edition

By: Michael Heydt

Overview of this book

You will learn how to use pandas to perform data analysis in Python. You will start with an overview of data analysis and iteratively progress from modeling data, to accessing data from remote sources, performing numeric and statistical analysis, through indexing and performing aggregate analysis, and finally to visualizing statistical data and applying pandas to finance. With the knowledge you gain from this book, you will quickly learn pandas and how it can empower you in the exciting world of data manipulation, analysis and science.

Preface

What this book covers

What you need for this book

Free Chapter

pandas and Data Analysis

Introducing pandas

Data manipulation, analysis, science, and pandas

The process of data analysis

Relating the book to the process

Concepts of data and analysis in our tour of pandas

Other Python libraries of value with pandas

Summary

Up and Running with pandas

Installation of Anaconda

IPython and Jupyter Notebook

Introducing the pandas Series and DataFrame

Visualization

Summary

Representing Univariate Data with the Series

Configuring pandas

Creating a Series

The .index and .values properties

The size and shape of a Series

Specifying an index at creation

Heads, tails, and takes

Retrieving values in a Series by label or position

Slicing a Series into subsets

Alignment via index labels

Performing Boolean selection

Re-indexing a Series

Modifying a Series in-place

Summary

Representing Tabular and Multivariate Data with the DataFrame

Configuring pandas

Creating DataFrame objects

Accessing data within a DataFrame

Selecting rows using Boolean selection

Selecting across both rows and columns

Summary

Manipulating DataFrame Structure

Configuring pandas

Renaming columns

Adding new columns with [] and .insert()

Adding columns through enlargement

Adding columns using concatenation

Reordering columns

Replacing the contents of a column

Deleting columns

Appending new rows

Concatenating rows

Adding and replacing rows via enlargement

Removing rows using .drop()

Removing rows using Boolean selection

Removing rows using a slice

Summary

Indexing Data

Configuring pandas

The importance of indexes

The pandas index types

Working with Indexes

Hierarchical indexing

Summary

Categorical Data

Configuring pandas

Creating Categoricals

Renaming categories

Appending new categories

Removing categories

Removing unused categories

Setting categories

Descriptive information of a Categorical

Munging school grades

Summary

Numerical and Statistical Methods

Configuring pandas

Performing numerical methods on pandas objects

Performing statistical processes on pandas objects

Summary

Accessing Data

Configuring pandas

Working with CSV and text/tabular format data

Reading and writing data in Excel format

Reading and writing JSON files

Reading HTML data from the web

Reading and writing HDF5 format files

Accessing CSV data on the web

Reading and writing from/to SQL databases

Reading data from remote data services

Summary

Tidying Up Your Data

Configuring pandas

What is tidying your data?

How to work with missing data

Handling duplicate data

Transforming data

Summary

Combining, Relating, and Reshaping Data

Configuring pandas

Concatenating data in multiple objects

Merging and joining data

Pivoting data to and from value and indexes

Stacking and unstacking

Performance benefits of stacked data

Summary

Data Aggregation

Configuring pandas

The split, apply, and combine (SAC) pattern

Data for the examples

Splitting data

Applying aggregate functions, transforms, and filters

Transforming groups of data

Filtering groups from aggregation

Summary

Time-Series Modelling

Setting up the IPython notebook

Representation of dates, time, and intervals

Introducing time-series data

Calculating new dates using offsets

Representing durations of time using Period

Handling holidays using calendars

Normalizing timestamps using time zones

Manipulating time-series data

Time-series moving-window operations

Summary

Visualization

Configuring pandas

Plotting basics with pandas

Creating time-series charts

Common plots used in statistical analyses

Manually rendering multiple plots in a single chart

Summary

Historical Stock Price Analysis

Setting up the IPython notebook

Obtaining and organizing stock data from Google

Plotting time-series prices

Plotting volume-series data

Calculating the simple daily percentage change in closing price

Calculating simple daily cumulative returns of a stock

Resampling data from daily to monthly returns

Analyzing distribution of returns

Performing a moving-average calculation

Comparison of average daily returns across stocks

Correlation of stocks based on the daily percentage change of the closing price

Calculating the volatility of stocks

Determining risk relative to expected returns

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

The process of data analysis

The primary goal of this book is to thoroughly teach you how to use pandas to manipulate data. But there is a secondary, and perhaps no less important, goal of showing how pandas fits into the processes that a data analyst/scientist performs in everyday life.

One description of the steps involved in the process of data analysis is given on the pandas web site:

Munging and cleaning data
Analyzing/modeling
Organization into a form suitable for communication

This small list is a good initial definition, but it fails to cover the overall scope of the process and why many features implemented in pandas were created. The following expands upon this process and sets the framework for what is to come throughout this journey.

The process

The proposed process is one that will be referred to as The Data Process and is represented in the following diagram:

This process sets up a framework for defining logical steps that are taken in working with data. For now, let's take a quick look at each of these steps in the process and some of the tasks that you as a data analyst using pandas will perform.

It is important to understand that this is not purely a linear process. It is best done in a highly interactive and agile/iterative manner.

Ideation

The first step in any data problem is to identify what it is you want to figure out. This is referred to as ideation, of coming up with an idea of what we want to do and prove. Ideation generally relates to hypothesizing about patterns in data that can be used to make intelligent decisions.

These decisions are often within the context of a business, but also within other disciplines such as the sciences and research. The in-vogue thing right now is understanding the operations of businesses, as there are often copious amounts of money to be made in understanding data.

But what kinds of decision are we typically looking to make? The following are several questions for which answers are commonly asked:

Why did something happen?
Can we predict the future using historical data?
How can I optimize operations in the future?

This list is by no means exhaustive, but it does cover a sizable percentage of the reasons why anyone undertakes these endeavors. To get answers to these questions, one must be involved with collecting and understanding data relative to the problem. This involves defining what data is going to be researched, what the benefit is of the research, how the data is going to be obtained, what the success criteria are, and how the information is going to be eventually communicated.

pandas itself does not provide tools to assist in ideation. But once you have gained understanding and skill in using pandas, you will naturally realize how pandas will help you in being able to formulate ideas. This is because you will be armed with a powerful tool you can used to frame many complicated hypotheses.

Retrieval

Once you have an idea you must then find data to try and support your hypothesis. This data can come from within your organization or from external data providers. This data normally is provided as archived data or can be provided in real-time (although pandas is not well known for being a real-time data processing tool).

Data is often very raw, even if obtained from data sources that you have created or from within your organization. Being raw means that the data can be disorganized, may be in various formats, and erroneous; relative to supporting your analysis, it may be incomplete and need manual augmentation.

There is a lot of free data in the world. Much data is not free and actually costs significant amounts of money to obtain. Some is freely available with public APIs, and the others by subscription. Data you pay for is often cleaner, but this is not always the case.

In either case, pandas provides a robust and easy-to-use set of tools for retrieving data from various sources and that may be in many different formats. pandas also gives us the ability to not only retrieve data, but to also provide an initial structuring of the data via pandas data structures without needing to manually create complex coding, which may be required in other tools or programming languages.

Preparation

During preparation, raw data is made ready for exploration. This preparation is often a very interesting process. It is very frequently the case that data from is fraught with all kinds of issues related to quality. You will likely spend a lot of time dealing with these quality issues, and often this is a very non-trivial amount of time.

Why? Well there are a number of reasons:

The data is simply incorrect
Parts of the dataset are missing
Data is not represented using measurements appropriate for your analysis
The data is in formats not convenient for your analysis
Data is at a level of detail not appropriate for your analysis
Not all the fields you need are available from a single source
The representation of data differs depending upon the provider

The preparation process focuses on solving these issues. pandas provides many great facilities for preparing data, often referred to as tidying up data. These facilities include intelligent means of handling missing data, converting data types, using format conversion, changing frequencies of measurements, joining data from multiple sets of data, mapping/converting symbols into shared representations, and grouping data, among many others. We will cover all of these in depth.

Exploration

Exploration involves being able to interactively slice and dice your data to try and make quick discoveries. Exploration can include various tasks such as:

Examining how variables relate to each other
Determining how the data is distributed
Finding and excluding outliers
Creating quick visualizations
Quickly creating new data representations or models to feed into more permanent and detailed modeling processes

Exploration is one of the great strengths of pandas. While exploration can be performed in most programming languages, each has its own level of ceremony—how much non-exploratory effort must be performed—before actually getting to discoveries.

When used with the read-eval-print-loop (REPL) nature of IPython and/or Jupyter notebooks, pandas creates an exploratory environment that is almost free of ceremony. The expressiveness of the syntax of pandas lets you describe complex data manipulation constructs succinctly, and the result of every action you take upon your data is immediately presented for your inspection. This allows you to quickly determine the validity of the action you just took without having to recompile and completely rerun your programs.

Modeling

In the modeling stage you formalize your discoveries found during exploration into an explicit explanation of the steps and data structures required to get to the desired meaning contained within your data. This is the model, a combination of both data structures as well as steps in code to get from the raw data to your information and conclusions.

The modeling process is iterative where, through an exploration of the data, you select the variables required to support your analysis, organize the variables for input to analytical processes, execute the model, and determine how well the model supports your original assumptions. It can include a formal modeling of the structure of the data, but can also combine techniques from various analytic domains such as (and not limited to) statistics, machine learning, and operations research.

To facilitate this, pandas provides extensive data modeling facilities. It is in this step that you will move more from exploring your data, to formalizing the data model in DataFrame objects, and ensuring the processes to create these models are succinct. Additionally, by being based in Python, you get to use its full power to create programs to automate the process from beginning to end. The models you create are executable.

From an analytic perspective, pandas provides several capabilities, most notably integrated support for descriptive statistics, which can get you to your goal for many types of problems. And because pandas is Python-based, if you need more advanced analytic capabilities, it is very easy to integrate with other parts of the extensive Python scientific environment.

Presentation

The penultimate step of the process is presenting your findings to others, typically in the form of a report or presentation. You will want to create a persuasive and thorough explanation of your solution. This can often be done using various plotting tools in Python and manually creating a presentation.

Jupyter notebooks are a powerful tool in creating presentations for your analyses with pandas. These notebooks provide a means of both executing code and providing rich markdown capabilities to annotate and describe the execution at multiple points in the application. These can be used to create very effective, executable presentations that are visually rich with pieces of code, stylized text, and graphics.

We will explore Jupyter notebooks briefly in Chapter 2, Up and Running with pandas.

Reproduction

An important piece of research is sharing and making your research reproducible. It is often said that if other researchers cannot reproduce your experiment and results, then you didn't prove a thing.

Fortunately, for you, by having used pandas and Python, you will be able to easily make your analysis reproducible. This can be done by sharing the Python code that drives your pandas code, as well as the data.

Jupyter notebooks also provide a convenient means of packaging both the code and application in a means that can be easily shared with anyone else with a Jupyter Notebook installation. And there are many free, and secure, sharing sites on the internet that allow you to either create or deploy your Jupyter notebooks for sharing.

A note on being iterative and agile

Something very important to understand about data manipulation, analysis, and science is that it is an iterative process. Although there is a natural forward flow along the stages previously discussed, you will end up going forwards and backwards in the process. For instance, while in the exploration phase you may identify anomalies in the data that relate to data purity issues from the preparation stage, and need to go back and rectify those issues.

This is part of the fun of the process. You are on an adventure to solve your initial problem, all the while gaining incremental insights about the data you are working with. These insights may lead you to ask new questions, to more exact questions, or to a realization that your initial questions were not the actual questions that needed to be asked. The process is truly a journey and not necessarily the destination.

Learning pandas - Second Edition

By : Michael Heydt

Learning pandas - Second Edition

By: Michael Heydt

Overview of this book

Related Content you might be interested in

Current Title:

Learning pandas - Second Edition

SciPy Recipes

Hands-On Data Analysis with NumPy and pandas

Mastering pandas.