Learning pandas - Second Edition

By : Michael Heydt

Learning pandas - Second Edition

By: Michael Heydt

Overview of this book

You will learn how to use pandas to perform data analysis in Python. You will start with an overview of data analysis and iteratively progress from modeling data, to accessing data from remote sources, performing numeric and statistical analysis, through indexing and performing aggregate analysis, and finally to visualizing statistical data and applying pandas to finance. With the knowledge you gain from this book, you will quickly learn pandas and how it can empower you in the exciting world of data manipulation, analysis and science.

Preface

What this book covers

What you need for this book

Free Chapter

pandas and Data Analysis

Introducing pandas

Data manipulation, analysis, science, and pandas

The process of data analysis

Relating the book to the process

Concepts of data and analysis in our tour of pandas

Other Python libraries of value with pandas

Summary

Up and Running with pandas

Installation of Anaconda

IPython and Jupyter Notebook

Introducing the pandas Series and DataFrame

Visualization

Summary

Representing Univariate Data with the Series

Configuring pandas

Creating a Series

The .index and .values properties

The size and shape of a Series

Specifying an index at creation

Heads, tails, and takes

Retrieving values in a Series by label or position

Slicing a Series into subsets

Alignment via index labels

Performing Boolean selection

Re-indexing a Series

Modifying a Series in-place

Summary

Representing Tabular and Multivariate Data with the DataFrame

Configuring pandas

Creating DataFrame objects

Accessing data within a DataFrame

Selecting rows using Boolean selection

Selecting across both rows and columns

Summary

Manipulating DataFrame Structure

Configuring pandas

Renaming columns

Adding new columns with [] and .insert()

Adding columns through enlargement

Adding columns using concatenation

Reordering columns

Replacing the contents of a column

Deleting columns

Appending new rows

Concatenating rows

Adding and replacing rows via enlargement

Removing rows using .drop()

Removing rows using Boolean selection

Removing rows using a slice

Summary

Indexing Data

Configuring pandas

The importance of indexes

The pandas index types

Working with Indexes

Hierarchical indexing

Summary

Categorical Data

Configuring pandas

Creating Categoricals

Renaming categories

Appending new categories

Removing categories

Removing unused categories

Setting categories

Descriptive information of a Categorical

Munging school grades

Summary

Numerical and Statistical Methods

Configuring pandas

Performing numerical methods on pandas objects

Performing statistical processes on pandas objects

Summary

Accessing Data

Configuring pandas

Working with CSV and text/tabular format data

Reading and writing data in Excel format

Reading and writing JSON files

Reading HTML data from the web

Reading and writing HDF5 format files

Accessing CSV data on the web

Reading and writing from/to SQL databases

Reading data from remote data services

Summary

Tidying Up Your Data

Configuring pandas

What is tidying your data?

How to work with missing data

Handling duplicate data

Transforming data

Summary

Combining, Relating, and Reshaping Data

Configuring pandas

Concatenating data in multiple objects

Merging and joining data

Pivoting data to and from value and indexes

Stacking and unstacking

Performance benefits of stacked data

Summary

Data Aggregation

Configuring pandas

The split, apply, and combine (SAC) pattern

Data for the examples

Splitting data

Applying aggregate functions, transforms, and filters

Transforming groups of data

Filtering groups from aggregation

Summary

Time-Series Modelling

Setting up the IPython notebook

Representation of dates, time, and intervals

Introducing time-series data

Calculating new dates using offsets

Representing durations of time using Period

Handling holidays using calendars

Normalizing timestamps using time zones

Manipulating time-series data

Time-series moving-window operations

Summary

Visualization

Configuring pandas

Plotting basics with pandas

Creating time-series charts

Common plots used in statistical analyses

Manually rendering multiple plots in a single chart

Summary

Historical Stock Price Analysis

Setting up the IPython notebook

Obtaining and organizing stock data from Google

Plotting time-series prices

Plotting volume-series data

Calculating the simple daily percentage change in closing price

Calculating simple daily cumulative returns of a stock

Resampling data from daily to monthly returns

Analyzing distribution of returns

Performing a moving-average calculation

Comparison of average daily returns across stocks

Correlation of stocks based on the daily percentage change of the closing price

Calculating the volatility of stocks

Determining risk relative to expected returns

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Introducing pandas

pandas is a Python library containing high-level data structures and tools that have been created to help Python programmers to perform powerful data analysis. The ultimate purpose of pandas is to help you quickly discover information in data, with information being defined as an underlying meaning.

The development of pandas was begun in 2008 by Wes McKinney; it was open sourced in 2009. pandas is currently supported and actively developed by various organizations and contributors.

pandas was initially designed with finance in mind specifically with its ability around time series data manipulation and processing historical stock information. The processing of financial information has many challenges, the following being a few:

Representing security data, such as a stock's price, as it changes over time
Matching the measurement of multiple streams of data at identical times
Determining the relationship (correlation) of two or more streams of data
Representing times and dates as first-class entities
Converting the period of samples of data, either up or down

To do this processing, a tool was needed that allows us to retrieve, index, clean and tidy, reshape, combine, slice, and perform various analyses on both single- and multidimensional data, including heterogeneous-typed data that is automatically aligned along a set of common index labels. This is where pandas comes in, having been created with many useful and powerful features such as the following:

Fast and efficient Series and DataFrame objects for data manipulation with integrated indexing
Intelligent data alignment using indexes and labels
Integrated handling of missing data
Facilities for converting messy data into orderly data (tidying)
Built-in tools for reading and writing data between in-memory data structures and files, databases, and web services
The ability to process data stored in many common formats such as CSV, Excel, HDF5, and JSON
Flexible reshaping and pivoting of sets of data
Smart label-based slicing, fancy indexing, and subsetting of large datasets
Columns can be inserted and deleted from data structures for size mutability
Aggregating or transforming data with a powerful data grouping facility to perform split-apply-combine on datasets
High-performance merging and joining of datasets
Hierarchical indexing facilitating working with high-dimensional data in a lower-dimensional data structure
Extensive features for time series data, including date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting, and lagging
Highly optimized for performance, with critical code paths written in Cython or C

The robust feature set, combined with its seamless integration with Python and other tools within the Python ecosystem, has given pandas wide adoption in many domains. It is in use in a wide variety of academic and commercial domains, including finance, neurosciences, economics, statistics, advertising, and web analytic. It has become one of the most preferred tools for data scientists to represent data for manipulation and analysis.

Python has long been exceptional for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain -specific language such as R. This is very important, as those familiar with Python, a more generalized programming language than R (more a statistical package), gain many data representation and manipulation features of R while remaining entirely within an incredibly rich Python ecosystem.

Combined with IPython, Jupyter notebooks, and a wide range of other libraries, the environment for performing data analysis in Python excels in performance, productivity, and the ability to collaborate, compared to many other tools. This has led to the widespread adoption of pandas by many users in many industries.

Learning pandas - Second Edition

By : Michael Heydt

Learning pandas - Second Edition

By: Michael Heydt

Overview of this book

Related Content you might be interested in

Current Title:

Learning pandas - Second Edition

SciPy Recipes

Hands-On Data Analysis with NumPy and Pandas

Mastering pandas.