Learning pandas - Second Edition

By : Michael Heydt

Learning pandas - Second Edition

By: Michael Heydt

Overview of this book

You will learn how to use pandas to perform data analysis in Python. You will start with an overview of data analysis and iteratively progress from modeling data, to accessing data from remote sources, performing numeric and statistical analysis, through indexing and performing aggregate analysis, and finally to visualizing statistical data and applying pandas to finance. With the knowledge you gain from this book, you will quickly learn pandas and how it can empower you in the exciting world of data manipulation, analysis and science.

Preface

What this book covers

What you need for this book

Free Chapter

pandas and Data Analysis

Introducing pandas

Data manipulation, analysis, science, and pandas

The process of data analysis

Relating the book to the process

Concepts of data and analysis in our tour of pandas

Other Python libraries of value with pandas

Summary

Up and Running with pandas

Installation of Anaconda

IPython and Jupyter Notebook

Introducing the pandas Series and DataFrame

Visualization

Summary

Representing Univariate Data with the Series

Configuring pandas

Creating a Series

The .index and .values properties

The size and shape of a Series

Specifying an index at creation

Heads, tails, and takes

Retrieving values in a Series by label or position

Slicing a Series into subsets

Alignment via index labels

Performing Boolean selection

Re-indexing a Series

Modifying a Series in-place

Summary

Representing Tabular and Multivariate Data with the DataFrame

Configuring pandas

Creating DataFrame objects

Accessing data within a DataFrame

Selecting rows using Boolean selection

Selecting across both rows and columns

Summary

Manipulating DataFrame Structure

Configuring pandas

Renaming columns

Adding new columns with [] and .insert()

Adding columns through enlargement

Adding columns using concatenation

Reordering columns

Replacing the contents of a column

Deleting columns

Appending new rows

Concatenating rows

Adding and replacing rows via enlargement

Removing rows using .drop()

Removing rows using Boolean selection

Removing rows using a slice

Summary

Indexing Data

Configuring pandas

The importance of indexes

The pandas index types

Working with Indexes

Hierarchical indexing

Summary

Categorical Data

Configuring pandas

Creating Categoricals

Renaming categories

Appending new categories

Removing categories

Removing unused categories

Setting categories

Descriptive information of a Categorical

Munging school grades

Summary

Numerical and Statistical Methods

Configuring pandas

Performing numerical methods on pandas objects

Performing statistical processes on pandas objects

Summary

Accessing Data

Configuring pandas

Working with CSV and text/tabular format data

Reading and writing data in Excel format

Reading and writing JSON files

Reading HTML data from the web

Reading and writing HDF5 format files

Accessing CSV data on the web

Reading and writing from/to SQL databases

Reading data from remote data services

Summary

Tidying Up Your Data

Configuring pandas

What is tidying your data?

How to work with missing data

Handling duplicate data

Transforming data

Summary

Combining, Relating, and Reshaping Data

Configuring pandas

Concatenating data in multiple objects

Merging and joining data

Pivoting data to and from value and indexes

Stacking and unstacking

Performance benefits of stacked data

Summary

Data Aggregation

Configuring pandas

The split, apply, and combine (SAC) pattern

Data for the examples

Splitting data

Applying aggregate functions, transforms, and filters

Transforming groups of data

Filtering groups from aggregation

Summary

Time-Series Modelling

Setting up the IPython notebook

Representation of dates, time, and intervals

Introducing time-series data

Calculating new dates using offsets

Representing durations of time using Period

Handling holidays using calendars

Normalizing timestamps using time zones

Manipulating time-series data

Time-series moving-window operations

Summary

Visualization

Configuring pandas

Plotting basics with pandas

Creating time-series charts

Common plots used in statistical analyses

Manually rendering multiple plots in a single chart

Summary

Historical Stock Price Analysis

Setting up the IPython notebook

Obtaining and organizing stock data from Google

Plotting time-series prices

Plotting volume-series data

Calculating the simple daily percentage change in closing price

Calculating simple daily cumulative returns of a stock

Resampling data from daily to monthly returns

Analyzing distribution of returns

Performing a moving-average calculation

Comparison of average daily returns across stocks

Correlation of stocks based on the daily percentage change of the closing price

Calculating the volatility of stocks

Determining risk relative to expected returns

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Data manipulation, analysis, science, and pandas

We live in a world in which massive amounts of data are produced and stored every day. This data comes from a plethora of information systems, devices, and sensors. Almost everything you do, and items you use to do it, produces data which can be, or is, captured.

This has been greatly enabled by the ubiquitous nature of services that are connected to networks, and by the great increases in data storage facilities; this, combined with the ever-decreasing cost of storage, has made capturing and storing even the most trivial of data effective.

This has led to massive amounts of data being piled up and ready for access. But this data is spread out all over cyber-space, and is cannot actually be referred to as information. It tends to be a collected collection of the recording of events, whether financial, of your interactions with social networks, or of your personal health monitor tracking your heartbeat throughout the day. This data is stored in all kinds of formats, is located in scattered places, and beyond its raw nature does give much insight.

Logically, the overall process can be broken into three major areas of discipline:

Data manipulation
Data analysis
Data science

These three disciplines can and do have a lot of overlap. Where each ends and the others begin is open to interpretation. For the purposes of this book we will define each as in the following sections.

Data manipulation

Data is distributed all over the planet. It is stored in different formats. It has widely varied levels of quality. Because of this there is a need for tools and processes for pulling data together and into a form that can be used for decision making. This requires many different tasks and capabilities from a tool that manipulates data in preparation for analysis. The features needed from such a tool include:

Programmability for reuse and sharing
Access to data from external sources
Storing data locally
Indexing data for efficient retrieval
Alignment of data in different sets based upon attributes
Combining data in different sets
Transformation of data into other representations
Cleaning data from cruft
Effective handling of bad data
Grouping data into common baskets
Aggregation of data of like characteristics
Application of functions to calculate meaning or perform transformations
Query and slicing to explore pieces of the whole
Restructuring into other forms
Modeling distinct categories of data such as categorical, continuous, discrete, and time series
Resampling data to different frequencies

There are many data manipulation tools in existence. Each differs in support for the items on this list, how they are deployed, and how they are utilized by their users. These tools include relational databases (SQL Server, Oracle), spreadsheets (Excel), event processing systems (such as Spark), and more generic tools such as R and pandas.

Data analysis

Data analysis is the process of creating meaning from data. Data with quantified meaning is often called information. Data analysis is the process of creating information from data through the creation of data models and mathematics to find patterns. It often overlaps data manipulation and the distinction between the two is not always clear. Many data manipulation tools also contain analyses functions, and data analysis tools often provide data manipulation capabilities.

Data science

Data science is the process of using statistics and data analysis processes to create an understanding of phenomena within data. Data science usually starts with information and applies a more complex domain-based analysis to the information. These domains span many fields such as mathematics, statistics, information science, computer science, machine learning, classification, cluster analysis, data mining, databases, and visualization. Data science is multidisciplinary. Its methods of domain analysis are often very different and specific to a specific domain.

Where does pandas fit?

pandas first and foremost excels in data manipulation. All of the needs itemized earlier will be covered in this book using pandas. This is the core of pandas and is most of what we will focus on in this book.

It is worth noting that that pandas has a specific design goal: emphasizing data

But pandas does provide several features for performing data analysis. These capabilities typically revolve around descriptive statistics and functions required for finance such as correlations.

Therefore, pandas itself is not a data science toolkit. It is more of a manipulation tool with some analysis capabilities. pandas explicitly leaves complex statistical, financial, and other types of analyses to other Python libraries, such as SciPy, NumPy, scikit-learn, and leans upon graphics libraries such as matplotlib and ggvis for data visualization.

This focus is actually a strength of pandas over other languages such as R as pandas applications are able to leverage an extensive network of robust Python frameworks already built and tested elsewhere by the Python community.

Learning pandas - Second Edition

By : Michael Heydt

Learning pandas - Second Edition

By: Michael Heydt

Overview of this book

Related Content you might be interested in

Current Title:

Learning pandas - Second Edition

SciPy Recipes

Hands-On Data Analysis with NumPy and Pandas

Mastering pandas.