Book Image

Python Data Analysis Cookbook

By : Ivan Idris
Book Image

Python Data Analysis Cookbook

By: Ivan Idris

Overview of this book

Data analysis is a rapidly evolving field and Python is a multi-paradigm programming language suitable for object-oriented application development and functional design patterns. As Python offers a range of tools and libraries for all purposes, it has slowly evolved as the primary language for data science, including topics on: data analysis, visualization, and machine learning. Python Data Analysis Cookbook focuses on reproducibility and creating production-ready systems. You will start with recipes that set the foundation for data analysis with libraries such as matplotlib, NumPy, and pandas. You will learn to create visualizations by choosing color maps and palettes then dive into statistical data analysis using distribution algorithms and correlations. You’ll then help you find your way around different data and numerical problems, get to grips with Spark and HDFS, and then set up migration scripts for web mining. In this book, you will dive deeper into recipes on spectral analysis, smoothing, and bootstrapping methods. Moving on, you will learn to rank stocks and check market efficiency, then work with metrics and clusters. You will achieve parallelism to improve system performance by using multiple threads and speeding up your code. By the end of the book, you will be capable of handling various data analysis techniques in Python and devising solutions for problem scenarios.
Table of Contents (23 chapters)
Python Data Analysis Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Glossary
Index

Keeping track of package versions and history in IPython Notebook


The IPython Notebook was added to IPython 0.12 in December 2011. Many Pythonistas feel that the IPython Notebook is essential for reproducible data analysis. The IPython Notebook is comparable to commercial products such as Mathematica, MATLAB, and Maple. It is an interactive web browser-based environment. In this recipe, we will see how to keep track of package versions and store IPython sessions in the context of reproducible data analysis. By the way, the IPython Notebook has been renamed Jupyter Notebook.

Getting ready

For this recipe, you will need a recent IPython installation. The instructions to install IPython are at http://ipython.org/install.html (retrieved July 2015). Install it using the pip command:

$ [sudo] pip install ipython/jupyter

If you have installed IPython via Anaconda already, check for updates with the following commands:

$ conda update conda
$ conda update ipython ipython-notebook ipython-qtconsole

I have IPython 3.2.0 as part of the Anaconda distribution.

How to do it...

We will install log a Python session and use the watermark extension to track package versions and other information. Start an IPython shell or notebook. When we start a session, we can use the command line switch --logfile=<file name>.py. In this recipe, we use the %logstart magic (IPython terminology) function:

In [1]: %logstart cookbook_log.py rotate
Activating auto-logging. Current session state plus future input saved.
Filename       : cookbook_log.py
Mode           : rotate
Output logging : False
Raw input log  : False
Timestamping   : False
State          : active

This example invocation started logging to a file in rotate mode. Both the filename and mode are optional. Turn logging off and back on again as follows:

In [2]: %logoff
Switching logging OFF

In [3]: %logon
Switching logging ON

Install the watermark magic from Github with the following command:

In [4]: %install_ext https://raw.githubusercontent.com/rasbt/watermark/master/watermark.py

The preceding line downloads a Python file, in my case, to ~/.ipython/extensions/watermark.py. Load the extension by typing the following line:

%load_ext watermark

The extension can place timestamps as well as software and hardware information. Get additional usage documentation and version (I installed watermark 1.2.2) with the following command:

%watermark?

For example, call watermark without any arguments:

In [7]: %watermark
… Omitting time stamp …

CPython 3.4.3
IPython 3.2.0

compiler   : Omitting
system     : Omitting
release    : 14.3.0
machine    : x86_64
processor  : i386
CPU cores  : 8
interpreter: 64bit

I omitted the timestamp and other information for personal reasons. A more complete example follows with author name (-a), versions of packages specified as a comma-separated string (-p), and custom time (-c) in a strftime() based format:

In [8]: %watermark -a "Ivan Idris" -v -p numpy,scipy,matplotlib -c '%b %Y' -w
Ivan Idris 'Jul 2015'

CPython 3.4.3
IPython 3.2.0

numpy 1.9.2
scipy 0.15.1
matplotlib 1.4.3
watermark v. 1.2.2

How it works...

The IPython logger writes commands you type to a Python file. Most of the lines are in the following format:

get_ipython().magic('STRING_YOU_TYPED')

You can replay the session with %load <log file>. The logging modes are described in the following table:

Mode

Description

over

This mode overwrites existing log files.

backup

If a log file exists with the same name, the old file is renamed.

append

This mode appends lines to already existing files.

rotate

This mode rotates log files by incrementing numbers, so that log files don't get too big.

We used a custom magic function available on the Internet. The code for the function is in a single Python file and it should be easy for you to follow. If you want different behavior, you just need to modify the file.

See also