Book Image

Learning pandas - Second Edition

By : Michael Heydt
Book Image

Learning pandas - Second Edition

By: Michael Heydt

Overview of this book

You will learn how to use pandas to perform data analysis in Python. You will start with an overview of data analysis and iteratively progress from modeling data, to accessing data from remote sources, performing numeric and statistical analysis, through indexing and performing aggregate analysis, and finally to visualizing statistical data and applying pandas to finance. With the knowledge you gain from this book, you will quickly learn pandas and how it can empower you in the exciting world of data manipulation, analysis and science.
Table of Contents (16 chapters)

Introducing pandas

pandas is a Python library containing high-level data structures and tools that have been created to help Python programmers to perform powerful data analysis. The ultimate purpose of pandas is to help you quickly discover information in data, with information being defined as an underlying meaning.

The development of pandas was begun in 2008 by Wes McKinney; it was open sourced in 2009. pandas is currently supported and actively developed by various organizations and contributors.

pandas was initially designed with finance in mind specifically with its ability around time series data manipulation and processing historical stock information. The processing of financial information has many challenges, the following being a few:

  • Representing security data, such as a stock's price, as it changes over time
  • Matching the measurement of multiple streams of data at identical times
  • Determining the relationship (correlation) of two or more streams of data
  • Representing times and dates as first-class entities
  • Converting the period of samples of data, either up or down

To do this processing, a tool was needed that allows us to retrieve, index, clean and tidy, reshape, combine, slice, and perform various analyses on both single- and multidimensional data, including heterogeneous-typed data that is automatically aligned along a set of common index labels. This is where pandas comes in, having been created with many useful and powerful features such as the following:

  • Fast and efficient Series and DataFrame objects for data manipulation with integrated indexing
  • Intelligent data alignment using indexes and labels
  • Integrated handling of missing data
  • Facilities for converting messy data into orderly data (tidying)
  • Built-in tools for reading and writing data between in-memory data structures and files, databases, and web services
  • The ability to process data stored in many common formats such as CSV, Excel, HDF5, and JSON
  • Flexible reshaping and pivoting of sets of data
  • Smart label-based slicing, fancy indexing, and subsetting of large datasets
  • Columns can be inserted and deleted from data structures for size mutability
  • Aggregating or transforming data with a powerful data grouping facility to perform split-apply-combine on datasets
  • High-performance merging and joining of datasets
  • Hierarchical indexing facilitating working with high-dimensional data in a lower-dimensional data structure
  • Extensive features for time series data, including date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting, and lagging
  • Highly optimized for performance, with critical code paths written in Cython or C

The robust feature set, combined with its seamless integration with Python and other tools within the Python ecosystem, has given pandas wide adoption in many domains. It is in use in a wide variety of academic and commercial domains, including finance, neurosciences, economics, statistics, advertising, and web analytic. It has become one of the most preferred tools for data scientists to represent data for manipulation and analysis.

Python has long been exceptional for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain -specific language such as R. This is very important, as those familiar with Python, a more generalized programming language than R (more a statistical package), gain many data representation and manipulation features of R while remaining entirely within an incredibly rich Python ecosystem.

Combined with IPython, Jupyter notebooks, and a wide range of other libraries, the environment for performing data analysis in Python excels in performance, productivity, and the ability to collaborate, compared to many other tools. This has led to the widespread adoption of pandas by many users in many industries.