IPython Interactive Computing and Visualization Cookbook

We are becoming awash in the flood of digital data from scientific research, engineering, economics, politics, journalism, business, and many other domains. As a result, analyzing, visualizing, and harnessing data is the occupation of an increasingly large and diverse set of people. Quantitative skills such as programming, numerical computing, mathematics, statistics, and data mining, which form the core of data science, are more and more appreciated in a seemingly endless plethora of fields.

My previous book, Learning IPython for Interactive Computing and Data Visualization, Packt Publishing, published in 2013, was a beginner-level introduction to data science and numerical computing with Python. This widely-used programming language is also one of the most popular platforms for these disciplines.

This book continues that journey by presenting more than 100 advanced recipes for data science and mathematical modeling. These recipes not only cover programming and computing topics such as interactive computing, numerical computing, high-performance computing, parallel computing, and interactive visualization, but also data analysis topics such as statistics, data mining, machine learning, signal processing, and many others.

All of this book's code has been written in the IPython notebook. IPython is at the heart of the Python data analysis platform. Originally created to enhance the default Python console, IPython is now mostly known for its widely acclaimed notebook. This web-based interactive computational environment combines code, rich text, images, mathematical equations, and plots into a single document. It is an ideal gateway to data analysis and high-performance numerical computing in Python.

What this book is

This cookbook contains in excess of a hundred focused recipes, answering specific questions in numerical computing and data analysis with IPython on:

How to explore a public dataset with pandas, PyMC, and SciPy
How to create interactive plots, widgets, and Graphical User Interfaces in the IPython notebook
How to create a configurable IPython extension with custom magic commands
How to distribute asynchronous tasks in parallel with IPython
How to accelerate code with OpenMP, MPI, Numba, Cython, OpenCL, CUDA, and the Julia programming language
How to estimate a probability density from a dataset
How to get started using the R statistical programming language in the notebook
How to train a classifier or a regressor with scikit-learn
How to find interesting projections in a high-dimensional dataset
How to detect faces in an image
How to simulate a reaction-diffusion system
How to compute an itinerary in a road network

The choice made in this book was to introduce a wide range of different topics instead of delving into the details of a few methods. The goal is to give you a taste of the incredibly rich capabilities of Python for data science. All methods are applied on diverse real-world examples.

Every recipe of this book demonstrates not only how to apply a method, but also how and why it works. It is important to understand the mathematical concepts and ideas underlying the methods instead of merely applying them blindly.

Additionally, each recipe comes with many references for the interested reader who wants to know more. As online references change frequently, they will be kept up to date on the book's website (http://ipython-books.github.io).

What this book covers

This book is split into two parts:

Part 1 (chapters 1 to 6) covers advanced methods in interactive numerical computing, high-performance computing, and data visualization.

Part 2 (chapters 7 to 15) introduces standard methods in data science and mathematical modeling. All of these methods are applied to real-world data.

Part 1 – Advanced High-Performance Interactive Computing

Chapter 1, A Tour of Interactive Computing with IPython, contains a brief but intense introduction to data analysis and numerical computing with IPython. It not only covers common packages such as Python, NumPy, pandas, and matplotlib, but also advanced IPython topics such as interactive widgets in the notebook, custom magic commands, configurable IPython extensions, and new language kernels.

Chapter 2, Best Practices in Interactive Computing, details best practices to write reproducible, high-quality code: task automation, version control with Git, workflows with IPython, unit testing with nose, continuous integration, debugging, and other related topics. The importance of these subjects in computational research and data analysis cannot be overstated.

Chapter 3, Mastering the Notebook, covers advanced topics related to the IPython notebook, notably the notebook format, notebook conversions, and CSS/JavaScript customization. The new interactive widgets available since IPython 2.0 are also extensively covered. These techniques make data analysis in the notebook more interactive than ever.

Chapter 4, Profiling and Optimization, covers methods to make your code faster and more efficient: CPU and memory profiling in Python, advanced optimization techniques with NumPy (including large array manipulations), and memory mapping of huge arrays with the HDF5 file format and the PyTables library. These techniques are essential for big data analysis.

Chapter 5, High-performance Computing, covers advanced techniques to make your code much faster: code acceleration with Numba and Cython, wrapping C libraries in Python with ctypes, parallel computing with IPython, OpenMP, and MPI, and General-Purpose Computing on Graphics Processing Units (GPGPU) with CUDA and OpenCL. The chapter ends with an introduction to the recent Julia language, which was designed for high-performance numerical computing and can be easily used in the IPython notebook.

Chapter 6, Advanced Visualization, introduces a few data visualization libraries that go beyond matplotlib in terms of styling or programming interfaces. It also covers interactive visualization in the notebook with Bokeh, mpld3, and D3.js. The chapter ends with an introduction to Vispy, a library that leverages the power of Graphics Processing Units for high-performance interactive visualization of big data.

Part 2 – Standard Methods in Data Science and Applied Mathematics

Chapter 7, Statistical Data Analysis, covers methods for getting insight into data. It introduces classic frequentist and Bayesian methods for hypothesis testing, parametric and nonparametric estimation, and model inference. The chapter leverages Python libraries such as pandas, SciPy, statsmodels, and PyMC. The last recipe introduces the statistical language R, which can be easily used in the IPython notebook.

Chapter 8, Machine Learning, covers methods to learn and make predictions from data. Using the scikit-learn Python package, this chapter illustrates fundamental data mining and machine learning concepts such as supervised and unsupervised learning, classification, regression, feature selection, feature extraction, overfitting, regularization, cross-validation, and grid search. Algorithms addressed in this chapter include logistic regression, Naive Bayes, K-nearest neighbors, Support Vector Machines, random forests, and others. These methods are applied to various types of datasets: numerical data, images, and text.

Chapter 9, Numerical Optimization, is about minimizing or maximizing mathematical functions. This topic is pervasive in data science, notably in statistics, machine learning, and signal processing. This chapter illustrates a few root-finding, minimization, and curve fitting routines with SciPy.

Chapter 10, Signal Processing, is about extracting relevant information from complex and noisy data. These steps are sometimes required prior to running statistical and data mining algorithms. This chapter introduces standard signal processing methods such as Fourier transforms and digital filters.

Chapter 11, Image and Audio Processing, covers signal processing methods for images and sounds. It introduces image filtering, segmentation, computer vision, and face detection with scikit-image and OpenCV. It also presents methods for audio processing and synthesis.

Chapter 12, Deterministic Dynamical Systems, describes dynamical processes underlying particular types of data. It illustrates simulation techniques for discrete-time dynamical systems as well as for ordinary differential equations and partial differential equations.

Chapter 13, Stochastic Dynamical Systems, describes dynamical random processes underlying particular types of data. It illustrates simulation techniques for discrete-time Markov chains, point processes, and stochastic differential equations.

Chapter 14, Graphs, Geometry, and Geographic Information Systems, covers analysis and visualization methods for graphs, social networks, road networks, maps, and geographic data.

Chapter 15, Symbolic and Numerical Mathematics, introduces SymPy, a computer algebra system that brings symbolic computing to Python. The chapter ends with an introduction to Sage, another Python-based system for computational mathematics.

What you need for this book

You need to know the content of this book's prequel, Learning IPython for Interactive Computing and Data Visualization: Python programming, the IPython console and notebook, numerical computing with NumPy, basic data analysis with pandas as well as plotting with matplotlib. This book tackles advanced scientific programming topics that require you to be familiar with the scientific Python ecosystem.

In Part 2, you need to know the basics of calculus, linear algebra, and probability theory. These chapters introduce different topics in data science and applied mathematics (statistics, machine learning, numerical optimization, signal processing, dynamical systems, graph theory, and others). You will understand these recipes better if you know fundamental concepts such as real-valued functions, integrals, matrices, vector spaces, probabilities, and so on.

Installing Python

There are many ways to install Python. We highly recommend the free Anaconda distribution (http://store.continuum.io/cshop/anaconda/). This Python distribution contains most of the packages that we will be using in this book. It also includes a powerful packaging system named conda. The book's website contains all the instructions to install Anaconda and run the code examples. You should learn how to install packages (conda install packagename) and how to create multiple Python environments with conda.

The code of this book has been written for Python 3 (more precisely, the code has been tested on Python 3.4.1, Anaconda 2.0.1, Windows 8.1 64-bit, although it definitely works on Linux and Mac OS X), but it also works with Python 2.7. We mention any compatibility issue when required. These issues are rare in this book, because NumPy does the heavy lifting in most cases. NumPy's interface hasn't changed between Python 2 and Python 3.

If you're unsure about which Python version you should use, pick Python 3. You should only pick Python 2 if you really need to (for example, if you absolutely need a Python package that doesn't support Python 3, or if part of your user base is stuck with Python 2). We cover this question in greater detail in Chapter 2, Best Practices in Interactive Computing.

With Anaconda, you can install Python 2 and Python 3 side-by-side using conda environments. This is how you can easily run the couple of recipes in this book that require Python 2.

GitHub repositories

A home page and two GitHub repositories accompany this book:

The main webpage at http://ipython-books.github.io
The main GitHub repository, with the codes and references of all recipes, at https://github.com/ipython-books/cookbook-code
Datasets used in certain recipes at https://github.com/ipython-books/cookbook-data

The main GitHub repository is where you can:

Find all code examples as IPython notebooks
Find all up-to-date references
Find up-to-date installation instructions
Report errata, inaccuracies, or mistakes via the issue tracker
Propose fixes via Pull Requests
Add notes, comments, or further references via Pull Requests
Add new recipes via Pull Requests

The online list of references is a particularly important resource. It contains many links to tutorials, courses, books, and videos about the topics covered in this book.

You can also follow updates about the book on my website (http://cyrille.rossant.net) and on my Twitter account (@cyrillerossant).

Who this book is for

This book targets students, researchers, teachers, engineers, data scientists, analysts, journalists, economists, and hobbyists interested in data analysis and numerical computing.

Readers familiar with the scientific Python ecosystem will find many resources to sharpen their skills in high-performance interactive computing with IPython.

Readers who need to implement algorithms for domain-specific applications will appreciate the introductions to a wide variety of topics in data analysis and applied mathematics.

Readers who are new to numerical computing with Python should start with the prequel of this book, Learning IPython for Interactive Computing and Data Visualization, Cyrille Rossant, Packt Publishing, 2013. A second edition is planned for 2015.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Notebooks can be run in an interactive session via %run notebook.ipynb."

A block of code is set as follows:

def do_complete(self, code, cursor_pos):
    return {'status': 'ok',
            'cursor_start': ...,
            'cursor_end': ...,
            'matches': [...]}

Any command-line input or output is written as follows:

from IPython import embed
embed()

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "The simplest option is to launch them from the Clusters tab in the notebook dashboard."

Note

Warnings or important notes appear in a box like this.

Note

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Downloading the color images

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from the following link: https://www.packtpub.com/sites/default/files/downloads/4818OS_ColoredImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

IPython Interactive Computing and Visualization Cookbook

By : Cyrille Rossant

IPython Interactive Computing and Visualization Cookbook

By: Cyrille Rossant

Overview of this book

Related Content you might be interested in

Current Title:

IPython Interactive Computing and Visualization Cookbook

Preface

What this book is

What this book covers

Part 1 – Advanced High-Performance Interactive Computing

Part 2 – Standard Methods in Data Science and Applied Mathematics

What you need for this book

Installing Python

GitHub repositories

Who this book is for

Conventions

Note

Note

Reader feedback

Customer support

Downloading the example code

Downloading the color images

Errata

Piracy

Questions