Book Image

Learning pandas - Second Edition

By : Michael Heydt
Book Image

Learning pandas - Second Edition

By: Michael Heydt

Overview of this book

You will learn how to use pandas to perform data analysis in Python. You will start with an overview of data analysis and iteratively progress from modeling data, to accessing data from remote sources, performing numeric and statistical analysis, through indexing and performing aggregate analysis, and finally to visualizing statistical data and applying pandas to finance. With the knowledge you gain from this book, you will quickly learn pandas and how it can empower you in the exciting world of data manipulation, analysis and science.
Table of Contents (16 chapters)

Other Python libraries of value with pandas

pandas forms one small, but important, part of the data analysis and data science ecosystem within Python. As a reference, here are a few other important Python libraries worth noting. The list is not exhaustive, but outlines several you will likely come across..

Numeric and scientific computing - NumPy and SciPy

NumPy (http://www.numpy.org/) is the cornerstone toolbox for scientific computing with Python, and is included in most distributions of modern Python. It is actually a foundational toolbox from which pandas was built, and when using pandas you will almost certainly use it frequently. NumPy provides, among other things, support for multidimensional arrays with basic operations on them and useful linear algebra functions.

The use of the array features of NumPy goes hand in hand with pandas, specifically the pandas Series object. Most of our examples will reference NumPy, but the pandas Series functionality is such a tight superset of the NumPy array that we will, except for a few brief situations, not delve into details of NumPy.

SciPy (https://www.scipy.org/) provides a collection of numerical algorithms and domain-specific toolboxes, including signal processing, optimization, statistics, and much more.

Statistical analysis – StatsModels

StatsModels (http://statsmodels.sourceforge.net/) is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. Researchers across fields may find that Stats Models fully meets their needs for statistical computing and data analysis in Python.

Features include:

  • Linear regression models
  • Generalized linear models
  • Discrete choice models
  • Robust linear models
  • Many models and functions for time series analysis
  • Nonparametric estimators
  • A collection of datasets as examples
  • A wide range of statistical tests
  • Input-output tools for producing tables in a number of formats (text, LaTex, HTML) and for reading Stata files into NumPy and pandas
  • Plotting functions
  • Extensive unit tests to ensure correctness of results

Machine learning – scikit-learn

scikit-learn (http://scikit-learn.org/) is a machine learning library built from NumPy, SciPy, and matplotlib. It offers simple and efficient tools for common tasks in data analysis such as classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.

PyMC - stochastic Bayesian modeling

PyMC (https://github.com/pymc-devs/pymc) is a Python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo. Its flexibility and extensibility make it applicable to a large number of problems. Along with core sampling functionality, PyMC includes methods for summarizing output, plotting, goodness of fit, and convergence diagnostics.

Data visualization - matplotlib and seaborn

Python has a rich set of frameworks for data visualization. Two of the most popular are matplotlib and the newer seaborn.

Matplotlib

Matplotlib is a Python 2D plotting library that produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the Jupyter Notebook, web application servers, and four graphical user interface toolkits.

pandas contains very tight integration with matplotlib, including functions as part of Series and DataFrame objects that automatically call matplotlib. This does not mean that pandas is limited to just matplotlib. As we will see, this can be easily changed to others such as ggplot2 and seaborn.

Seaborn

Seaborn (http://seaborn.pydata.org/introduction.html) is a library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for NumPy and pandas data structures and statistical routines from SciPy and StatsModels. It provides additional functionality beyond matplotlib, and also by default demonstrates a richer and more modern visual style than matplotlib.