Book Image

Learning Pandas

By : Michael Heydt
Book Image

Learning Pandas

By: Michael Heydt

Overview of this book

Table of Contents (19 chapters)
Learning pandas
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

pandas and why it is important


pandas is a library containing high-level data structures and tools that have been created to assist a Python programmer to perform powerful data manipulations, and discover information in that data in a simple and fast way.

The simple and effective data analysis requires the ability to index, retrieve, tidy, reshape, combine, slice, and perform various analyses on both single and multidimensional data, including heterogeneous typed data that is automatically aligned along index labels. To enable these capabilities, pandas provides the following features (and many more not explicitly mentioned here):

  • High performance array and table structures for representation of homogenous and heterogeneous data sets: the Series and DataFrame objects

  • Flexible reshaping of data structure, allowing the ability to insert and delete both rows and columns of tabular data

  • Hierarchical indexing of data along multiple axes (both rows and columns), allowing multiple labels per data item

  • Labeling of series and tabular data to facilitate indexing and automatic alignment of data

  • Ability to easily identify and fix missing data, both in floating point and as non-floating point formats

  • Powerful grouping capabilities and a functionality to perform split-apply-combine operations on series and tabular data

  • Simple conversion from ragged and differently indexed data of both NumPy and Python data structures to pandas objects

  • Smart label-based slicing and subsetting of data sets, including intuitive and flexible merging, and joining of data with SQL-like constructs

  • Extensive I/O facilities to load and save data from multiple formats including CSV, Excel, relational and non-relational databases, HDF5 format, and JSON

  • Explicit support for time series-specific functionality, providing functionality for date range generation, moving window statistics, time shifting, lagging, and so on

  • Built-in support to retrieve and automatically parse data from various web-based data sources such as Yahoo!, Google Finance, the World Bank, and several others

For those desiring to get into data analysis and the emerging field of data science, pandas offers an excellent means for a Python programmer (or just an enthusiast) to learn data manipulation. For those just learning or coming from a statistical language like R, pandas can offer an excellent introduction to Python as a programming language.

pandas itself is not a data science toolkit. It does provide some statistical methods as a matter of convenience, but to draw conclusions from data, it leans upon other packages in the Python ecosystem, such as SciPy, NumPy, scikit-learn, and upon graphics libraries such as matplotlib and ggvis for data visualization. This is actually the strength of pandas over other languages such as R, as pandas applications are able to leverage an extensive network of robust Python frameworks already built and tested elsewhere.

In this book, we will look at how to use pandas for data manipulation, with a specific focus on gathering, cleaning, and manipulation of various forms of data using pandas. Detailed specifics of data science, finance, econometrics, social network analysis, Python, and IPython are left as reference. You can refer to some other excellent books on these topics already available at https://www.packtpub.com/.