Book Image

Learning pandas - Second Edition

By : Michael Heydt

Book Image

Learning pandas - Second Edition

By: Michael Heydt

Overview of this book

You will learn how to use pandas to perform data analysis in Python. You will start with an overview of data analysis and iteratively progress from modeling data, to accessing data from remote sources, performing numeric and statistical analysis, through indexing and performing aggregate analysis, and finally to visualizing statistical data and applying pandas to finance. With the knowledge you gain from this book, you will quickly learn pandas and how it can empower you in the exciting world of data manipulation, analysis and science.

Preface

What this book covers

What you need for this book

Who this book is for

Reader feedback

Customer support

Free Chapter

pandas and Data Analysis

pandas and Data Analysis

Introducing pandas

Data manipulation, analysis, science, and pandas

The process of data analysis

Relating the book to the process

Concepts of data and analysis in our tour of pandas

Other Python libraries of value with pandas

Up and Running with pandas

Up and Running with pandas

Installation of Anaconda

IPython and Jupyter Notebook

Introducing the pandas Series and DataFrame

Representing Univariate Data with the Series

Representing Univariate Data with the Series

Configuring pandas

Creating a Series

The .index and .values properties

The size and shape of a Series

Specifying an index at creation

Heads, tails, and takes

Retrieving values in a Series by label or position

Slicing a Series into subsets

Alignment via index labels

Performing Boolean selection

Re-indexing a Series

Modifying a Series in-place

Representing Tabular and Multivariate Data with the DataFrame

Representing Tabular and Multivariate Data with the DataFrame

Configuring pandas

Creating DataFrame objects

Accessing data within a DataFrame

Selecting rows using Boolean selection

Selecting across both rows and columns

Manipulating DataFrame Structure

Manipulating DataFrame Structure

Configuring pandas

Renaming columns

Adding new columns with [] and .insert()

Adding columns through enlargement

Adding columns using concatenation

Reordering columns

Replacing the contents of a column

Deleting columns

Appending new rows

Concatenating rows

Adding and replacing rows via enlargement

Removing rows using .drop()

Removing rows using Boolean selection

Removing rows using a slice

Indexing Data

Configuring pandas

The importance of indexes

The pandas index types

Working with Indexes

Hierarchical indexing

Categorical Data

Categorical Data

Configuring pandas

Creating Categoricals

Renaming categories

Appending new categories

Removing categories

Removing unused categories

Setting categories

Descriptive information of a Categorical

Munging school grades

Numerical and Statistical Methods

Numerical and Statistical Methods

Configuring pandas

Performing numerical methods on pandas objects

Performing statistical processes on pandas objects

Accessing Data

Configuring pandas

Working with CSV and text/tabular format data

Reading and writing data in Excel format

Reading and writing JSON files

Reading HTML data from the web

Reading and writing HDF5 format files

Accessing CSV data on the web

Reading and writing from/to SQL databases

Reading data from remote data services

Tidying Up Your Data

Tidying Up Your Data

Configuring pandas

What is tidying your data?

How to work with missing data

Handling duplicate data

Transforming data

Combining, Relating, and Reshaping Data

Combining, Relating, and Reshaping Data

Configuring pandas

Concatenating data in multiple objects

Merging and joining data

Pivoting data to and from value and indexes

Stacking and unstacking

Performance benefits of stacked data

Data Aggregation

Data Aggregation

Configuring pandas

The split, apply, and combine (SAC) pattern

Data for the examples

Applying aggregate functions, transforms, and filters

Transforming groups of data

Filtering groups from aggregation

Time-Series Modelling

Time-Series Modelling

Setting up the IPython notebook

Representation of dates, time, and intervals

Introducing time-series data

Calculating new dates using offsets

Representing durations of time using Period

Handling holidays using calendars

Normalizing timestamps using time zones

Manipulating time-series data

Time-series moving-window operations

Visualization

Configuring pandas

Plotting basics with pandas

Creating time-series charts

Common plots used in statistical analyses

Manually rendering multiple plots in a single chart

Historical Stock Price Analysis

Historical Stock Price Analysis

Setting up the IPython notebook

Obtaining and organizing stock data from Google

Plotting time-series prices

Plotting volume-series data

Calculating the simple daily percentage change in closing price

Calculating simple daily cumulative returns of a stock

Resampling data from daily to monthly returns

Analyzing distribution of returns

Performing a moving-average calculation

Comparison of average daily returns across stocks

Correlation of stocks based on the daily percentage change of the closing price

Calculating the volatility of stocks

Determining risk relative to expected returns

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Handling duplicate data

The data in your sample can often contain duplicate rows. This is just a reality of dealing with data that is collected automatically, or even a situation created when manually collecting data. In these situations, it is often considered best to error on the side of having duplicates instead of missing data, especially if the data can be considered to be idempotent. However, duplicate data can increase the size of the dataset, and if it is not idempotent, then it would not be appropriate to process the duplicates.

Pandas provides the .duplicates() method to facilitate finding duplicate data. This method returns a Boolean Series, where each entry represents whether or not the row is a duplicate. A True value represents that the specific row has appeared earlier in the DataFrame object, with all the column values identical.

The following demonstrates this...