Book Image

Python Data Analysis Cookbook

By : Ivan Idris
Book Image

Python Data Analysis Cookbook

By: Ivan Idris

Overview of this book

Data analysis is a rapidly evolving field and Python is a multi-paradigm programming language suitable for object-oriented application development and functional design patterns. As Python offers a range of tools and libraries for all purposes, it has slowly evolved as the primary language for data science, including topics on: data analysis, visualization, and machine learning. Python Data Analysis Cookbook focuses on reproducibility and creating production-ready systems. You will start with recipes that set the foundation for data analysis with libraries such as matplotlib, NumPy, and pandas. You will learn to create visualizations by choosing color maps and palettes then dive into statistical data analysis using distribution algorithms and correlations. You’ll then help you find your way around different data and numerical problems, get to grips with Spark and HDFS, and then set up migration scripts for web mining. In this book, you will dive deeper into recipes on spectral analysis, smoothing, and bootstrapping methods. Moving on, you will learn to rank stocks and check market efficiency, then work with metrics and clusters. You will achieve parallelism to improve system performance by using multiple threads and speeding up your code. By the end of the book, you will be capable of handling various data analysis techniques in Python and devising solutions for problem scenarios.
Table of Contents (23 chapters)
Python Data Analysis Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Glossary
Index

Learning to log for robust error checking


Notebooks are useful to keep track of what you did and what went wrong. Logging works in a similar fashion, and we can log errors and other useful information with the standard Python logging library.

For reproducible data analysis, it is good to know the modules our Python scripts import. In this recipe, I will introduce a minimal API from dautil that logs package versions of imported modules in a best effort manner.

Getting ready

In this recipe, we import NumPy and pandas, so you may need to import them. See the Configuring pandas recipe for pandas installation instructions. Installation instructions for NumPy can be found at http://docs.scipy.org/doc/numpy/user/install.html (retrieved July 2015). Alternatively, install NumPy with pip using the following command:

$ [sudo] pip install numpy

The command for Anaconda users is as follows:

$ conda install numpy

I have installed NumPy 1.9.2 via Anaconda. We also require AppDirs to find the appropriate directory to store logs. Install it with the following command:

$ [sudo] pip install appdirs

I have AppDirs 1.4.0 on my system.

How to do it...

To log, we need to create and set up loggers. We can either set up the loggers with code or use a configuration file. Configuring loggers with code is the more flexible option, but configuration files tend to be more readable. I use the log.conf configuration file from dautil:

[loggers]
keys=root

[handlers]
keys=consoleHandler,fileHandler

[formatters]
keys=simpleFormatter

[logger_root]
level=DEBUG
handlers=consoleHandler,fileHandler

[handler_consoleHandler]
class=StreamHandler
level=INFO
formatter=simpleFormatter
args=(sys.stdout,)

[handler_fileHandler]
class=dautil.log_api.VersionsLogFileHandler
formatter=simpleFormatter
args=('versions.log',)

[formatter_simpleFormatter]
format=%(asctime)s - %(name)s - %(levelname)s - %(message)s
datefmt=%d-%b-%Y

The file configures a logger to log to a file with the DEBUG level and to the screen with the INFO level. So, the logger logs more to the file than to the screen. The file also specifies the format of the log messages. I created a tiny API in dautil, which creates a logger with its get_logger() function and uses it to log the package versions of a client program with its log() function. The code is in the log_api.py file of dautil:

from pkg_resources import get_distribution
from pkg_resources import resource_filename
import logging
import logging.config
import pprint
from appdirs import AppDirs
import os


def get_logger(name):
    log_config = resource_filename(__name__, 'log.conf')
    logging.config.fileConfig(log_config)
    logger = logging.getLogger(name)

    return logger


def shorten(module_name):
    dot_i = module_name.find('.')

    return module_name[:dot_i]


def log(modules, name):
    skiplist = ['pkg_resources', 'distutils']

    logger = get_logger(name)
    logger.debug('Inside the log function')

    for k in modules.keys():
        str_k = str(k)

        if '.version' in str_k:
            short = shorten(str_k)

            if short in skiplist:
                continue

            try:
                logger.info('%s=%s' % (short,    
                            get_distribution(short).version))
            except ImportError:
                logger.warn('Could not impport', short)


class VersionsLogFileHandler(logging.FileHandler):
    def __init__(self, fName):
        dirs = AppDirs("PythonDataAnalysisCookbook", 
                       "Ivan Idris")
        path = dirs.user_log_dir
        print(path)

        if not os.path.exists(path):
            os.mkdir(path)

        super(VersionsLogFileHandler, self).__init__(
              os.path.join(path, fName))

The program that uses the API is in the log_demo.py file in this book's code bundle:

import sys
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from dautil import log_api

log_api.log(sys.modules, sys.argv[0])

How it works...

We configured a handler (VersionsLogFileHandler) that writes to file and a handler (StreamHandler) that displays messages on the screen. StreamHandler is a class in the Python standard library. To configure the format of the log messages, we used the SimpleFormater class from the Python standard library.

The API I made goes through modules listed in the sys.modules variable and tries to get the versions of the modules. Some of the modules are not relevant for data analysis, so we skip them. The log() function of the API logs a DEBUG level message with the debug() method. The info() method logs the package version at INFO level.

See also