Learning Pandas

Book Image

Learning Pandas

By : Michael Heydt

Book Image

Learning Pandas

By: Michael Heydt

Overview of this book

Learning pandas

Learning pandas

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

A Tour of pandas

A Tour of pandas

pandas and why it is important

pandas and IPython Notebooks

Referencing pandas in the application

Primary pandas objects

Loading data from files and the Web

Simplicity of visualization of pandas data

Installing pandas

Installing pandas

Getting Anaconda

Installing Anaconda

Ensuring pandas is up to date

Running a small pandas sample in IPython

Starting the IPython Notebook server

Installing and running IPython Notebooks

Using Wakari for pandas

NumPy for pandas

NumPy for pandas

Installing and importing NumPy

Benefits and characteristics of NumPy arrays

Creating NumPy arrays and performing basic array operations

Selecting array elements

Logical operations on arrays

Reshaping arrays

Combining arrays

Splitting arrays

Useful numerical methods of NumPy arrays

The pandas Series Object

The pandas Series Object

The Series object

Importing pandas

Creating Series

Size, shape, uniqueness, and counts of values

Peeking at data with heads, tails, and take

Looking up values in Series

Arithmetic operations

The special case of Not-A-Number (NaN)

Boolean selection

Reindexing a Series

Slicing a Series

The pandas DataFrame Object

The pandas DataFrame Object

Creating DataFrame from scratch

Selecting columns of a DataFrame

Selecting rows and values of a DataFrame using the index

Selecting rows of a DataFrame by Boolean selection

Modifying the structure and content of DataFrame

Arithmetic on a DataFrame

Resetting and reindexing

Hierarchical indexing

Summarized data and descriptive statistics

Accessing Data

Setting up the IPython notebook

Reading and writing JSON files

Accessing data on the web and in the cloud

Reading and writing from/to SQL databases

Reading data from remote data services

Tidying Up Your Data

Tidying Up Your Data

What is tidying your data?

Setting up the IPython notebook

Working with missing data

Handling duplicate data

Transforming Data

Combining and Reshaping Data

Combining and Reshaping Data

Setting up the IPython notebook

Concatenating data

Merging and joining data

Stacking and unstacking

Performance benefits of stacked data

Grouping and Aggregating Data

Grouping and Aggregating Data

Setting up the IPython notebook

The split, apply, and combine (SAC) pattern

Discretization and Binning

Time-series Data

Time-series Data

Setting up the IPython notebook

Representation of dates, time, and intervals

Introducing time-series data

Calculating new dates using offsets

Handling holidays using calendars

Normalizing timestamps using time zones

Manipulating time-series data

Visualization

Setting up the IPython notebook

Plotting basics with pandas

Common plots used in statistical analyses

Multiple plots in a single chart

Applications to Finance

Applications to Finance

Setting up the IPython notebook

Obtaining and organizing stock data from Yahoo!

Plotting time-series prices

Performing a moving-average calculation

Volatility calculation

Determining risk relative to expected returns

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Discretization and Binning

Although not directly using grouping constructs, in a chapter on grouping, it is worth explaining the process of discretization of continuous data. Discretization is a means of slicing up continuous data into a set of "bins", where each bin represents a range of the continuous sample and the items are then placed into the appropriate bin—hence the term "binning". Discretization in pandas is performed using the pd.cut() and pd.qcut() functions.

We will look at discretization by generating a large set of normally distributed random numbers and cutting these numbers into various pieces and analyzing the contents of the bins. The following generates 10000 numbers and reports the mean and standard deviation, which we expect to approach 0 and 1 as the sample size gets larger:

In [48]:
   # generate 10000 normal random #'s
   np.random.seed(123456)
   dist = np.random.normal(size = 10000)

   # show the mean and std
   "{0} {1}".format(dist.mean(), dist.std())

Out[48...