Learning Pandas

Book Image

Learning Pandas

By : Michael Heydt

Book Image

Learning Pandas

By: Michael Heydt

Overview of this book

Learning pandas

Learning pandas

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

A Tour of pandas

A Tour of pandas

pandas and why it is important

pandas and IPython Notebooks

Referencing pandas in the application

Primary pandas objects

Loading data from files and the Web

Simplicity of visualization of pandas data

Installing pandas

Installing pandas

Getting Anaconda

Installing Anaconda

Ensuring pandas is up to date

Running a small pandas sample in IPython

Starting the IPython Notebook server

Installing and running IPython Notebooks

Using Wakari for pandas

NumPy for pandas

NumPy for pandas

Installing and importing NumPy

Benefits and characteristics of NumPy arrays

Creating NumPy arrays and performing basic array operations

Selecting array elements

Logical operations on arrays

Reshaping arrays

Combining arrays

Splitting arrays

Useful numerical methods of NumPy arrays

The pandas Series Object

The pandas Series Object

The Series object

Importing pandas

Creating Series

Size, shape, uniqueness, and counts of values

Peeking at data with heads, tails, and take

Looking up values in Series

Arithmetic operations

The special case of Not-A-Number (NaN)

Boolean selection

Reindexing a Series

Slicing a Series

The pandas DataFrame Object

The pandas DataFrame Object

Creating DataFrame from scratch

Selecting columns of a DataFrame

Selecting rows and values of a DataFrame using the index

Selecting rows of a DataFrame by Boolean selection

Modifying the structure and content of DataFrame

Arithmetic on a DataFrame

Resetting and reindexing

Hierarchical indexing

Summarized data and descriptive statistics

Accessing Data

Setting up the IPython notebook

Reading and writing JSON files

Accessing data on the web and in the cloud

Reading and writing from/to SQL databases

Reading data from remote data services

Tidying Up Your Data

Tidying Up Your Data

What is tidying your data?

Setting up the IPython notebook

Working with missing data

Handling duplicate data

Transforming Data

Combining and Reshaping Data

Combining and Reshaping Data

Setting up the IPython notebook

Concatenating data

Merging and joining data

Stacking and unstacking

Performance benefits of stacked data

Grouping and Aggregating Data

Grouping and Aggregating Data

Setting up the IPython notebook

The split, apply, and combine (SAC) pattern

Discretization and Binning

Time-series Data

Time-series Data

Setting up the IPython notebook

Representation of dates, time, and intervals

Introducing time-series data

Calculating new dates using offsets

Handling holidays using calendars

Normalizing timestamps using time zones

Manipulating time-series data

Visualization

Setting up the IPython notebook

Plotting basics with pandas

Common plots used in statistical analyses

Multiple plots in a single chart

Applications to Finance

Applications to Finance

Setting up the IPython notebook

Obtaining and organizing stock data from Yahoo!

Plotting time-series prices

Performing a moving-average calculation

Volatility calculation

Determining risk relative to expected returns

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Performance benefits of stacked data

Finally, we will examine a reason for which we would want to stack data like this. This is because it can be shown to be more efficient than using lookup through a single level index and then a column lookup, or even compared to an .iloc lookup, specifying the location of the row and column by location. The following demonstrates this:

In [53]:
   # stacked scalar access can be a lot faster than 
   # column access

   # time the different methods
   import timeit
   t = timeit.Timer("stacked1[('one', 'a')]", 
                    "from __main__ import stacked1, df")
   r1 = timeit.timeit(lambda: stacked1.loc[('one', 'a')], 
                      number=10000)
   r2 = timeit.timeit(lambda: df.loc['one']['a'], 
                      number=10000)
   r3 = timeit.timeit(lambda: df.iloc[1, 0], 
                      number=10000)

   # and the results are...  Yes, it's the fastest of the three
   r1, r2, r3

Out[53]:
   (0.5598540306091309, 1.0486528873443604...