Python Data Analysis Cookbook

Python Data Analysis Cookbook

By : Ivan Idris

Buy this Book

Python Data Analysis Cookbook

By: Ivan Idris

Buy this Book

Overview of this book

Data analysis is a rapidly evolving field and Python is a multi-paradigm programming language suitable for object-oriented application development and functional design patterns. As Python offers a range of tools and libraries for all purposes, it has slowly evolved as the primary language for data science, including topics on: data analysis, visualization, and machine learning. Python Data Analysis Cookbook focuses on reproducibility and creating production-ready systems. You will start with recipes that set the foundation for data analysis with libraries such as matplotlib, NumPy, and pandas. You will learn to create visualizations by choosing color maps and palettes then dive into statistical data analysis using distribution algorithms and correlations. You’ll then help you find your way around different data and numerical problems, get to grips with Spark and HDFS, and then set up migration scripts for web mining. In this book, you will dive deeper into recipes on spectral analysis, smoothing, and bootstrapping methods. Moving on, you will learn to rank stocks and check market efficiency, then work with metrics and clusters. You will achieve parallelism to improve system performance by using multiple threads and speeding up your code. By the end of the book, you will be capable of handling various data analysis techniques in Python and devising solutions for problem scenarios.

Python Data Analysis Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Laying the Foundation for Reproducible Data Analysis

Introduction

Setting up Anaconda

Installing the Data Science Toolbox

Creating a virtual environment with virtualenv and virtualenvwrapper

Sandboxing Python applications with Docker images

Keeping track of package versions and history in IPython Notebook

Configuring IPython

Learning to log for robust error checking

Unit testing your code

Configuring pandas

Configuring matplotlib

Seeding random number generators and NumPy print options

Standardizing reports, code style, and data access

Creating Attractive Data Visualizations

Introduction

Graphing Anscombe's quartet

Choosing seaborn color palettes

Choosing matplotlib color maps

Interacting with IPython Notebook widgets

Viewing a matrix of scatterplots

Visualizing with d3.js via mpld3

Creating heatmaps

Combining box plots and kernel density plots with violin plots

Visualizing network graphs with hive plots

Displaying geographical maps

Using ggplot2-like plots

Highlighting data points with influence plots

Statistical Data Analysis and Probability

Introduction

Fitting data to the exponential distribution

Fitting aggregated data to the gamma distribution

Fitting aggregated counts to the Poisson distribution

Determining bias

Estimating kernel density

Determining confidence intervals for mean, variance, and standard deviation

Sampling with probability weights

Exploring extreme values

Correlating variables with Pearson's correlation

Correlating variables with the Spearman rank correlation

Correlating a binary and a continuous variable with the point biserial correlation

Evaluating relations between variables with ANOVA

Dealing with Data and Numerical Issues

Introduction

Clipping and filtering outliers

Winsorizing data

Measuring central tendency of noisy data

Normalizing with the Box-Cox transformation

Transforming data with the power ladder

Transforming data with logarithms

Rebinning data

Applying logit() to transform proportions

Fitting a robust linear model

Taking variance into account with weighted least squares

Using arbitrary precision for optimization

Using arbitrary precision for linear algebra

Web Mining, Databases, and Big Data

Introduction

Simulating web browsing

Scraping the Web

Dealing with non-ASCII text and HTML entities

Implementing association tables

Setting up database migration scripts

Adding a table column to an existing table

Adding indices after table creation

Setting up a test web server

Implementing a star schema with fact and dimension tables

Using HDFS

Setting up Spark

Clustering data with Spark

Signal Processing and Timeseries

Introduction

Spectral analysis with periodograms

Estimating power spectral density with the Welch method

Analyzing peaks

Measuring phase synchronization

Exponential smoothing

Evaluating smoothing

Using the Lomb-Scargle periodogram

Analyzing the frequency spectrum of audio

Analyzing signals with the discrete cosine transform

Block bootstrapping time series data

Moving block bootstrapping time series data

Applying the discrete wavelet transform

Selecting Stocks with Financial Data Analysis

Introduction

Computing simple and log returns

Ranking stocks with the Sharpe ratio and liquidity

Ranking stocks with the Calmar and Sortino ratios

Analyzing returns statistics

Correlating individual stocks with the broader market

Exploring risk and return

Examining the market with the non-parametric runs test

Testing for random walks

Determining market efficiency with autoregressive models

Creating tables for a stock prices database

Populating the stock prices database

Optimizing an equal weights two-asset portfolio

Text Mining and Social Network Analysis

Introduction

Creating a categorized corpus

Tokenizing news articles in sentences and words

Stemming, lemmatizing, filtering, and TF-IDF scores

Recognizing named entities

Extracting topics with non-negative matrix factorization

Implementing a basic terms database

Computing social network density

Calculating social network closeness centrality

Determining the betweenness centrality

Estimating the average clustering coefficient

Calculating the assortativity coefficient of a graph

Getting the clique number of a graph

Creating a document graph with cosine similarity

Ensemble Learning and Dimensionality Reduction

Introduction

Recursively eliminating features

Applying principal component analysis for dimension reduction

Applying linear discriminant analysis for dimension reduction

Stacking and majority voting for multiple models

Learning with random forests

Fitting noisy data with the RANSAC algorithm

Bagging to improve results

Boosting for better learning

Nesting cross-validation

Reusing models with joblib

Hierarchically clustering data

Taking a Theano tour

Evaluating Classifiers, Regressors, and Clusters

Introduction

Getting classification straight with the confusion matrix

Computing precision, recall, and F1-score

Examining a receiver operating characteristic and the area under a curve

Visualizing the goodness of fit

Computing MSE and median absolute error

Evaluating clusters with the mean silhouette coefficient

Comparing results with a dummy classifier

Determining MAPE and MPE

Comparing with a dummy regressor

Calculating the mean absolute error and the residual sum of squares

Examining the kappa of classification

Taking a look at the Matthews correlation coefficient

Analyzing Images

Introduction

Setting up OpenCV

Applying Scale-Invariant Feature Transform (SIFT)

Detecting features with SURF

Quantizing colors

Denoising images

Extracting patches from an image

Detecting faces with Haar cascades

Searching for bright stars

Extracting metadata from images

Extracting texture features from images

Applying hierarchical clustering on images

Segmenting images with spectral clustering

Parallelism and Performance

Introduction

Just-in-time compiling with Numba

Speeding up numerical expressions with Numexpr

Running multiple threads with the threading module

Launching multiple tasks with the concurrent.futures module

Accessing resources asynchronously with the asyncio module

Distributed processing with execnet

Profiling memory usage

Calculating the mean, variance, skewness, and kurtosis on the fly

Caching with a least recently used cache

Caching HTTP requests

Streaming counting with the Count-min sketch

Harnessing the power of the GPU with OpenCL

Glossary

Function Reference

IPython

Matplotlib

NumPy

pandas

Scikit-learn

SciPy

Seaborn

Statsmodels

Online Resources

IPython notebooks and open data

Mathematics and statistics

Tips and Tricks for Command-Line and Miscellaneous Tools

Reproducible sessions

Docker tips

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Appendix A. Glossary

This appendix is a brief glossary of technical concepts used throughout Python Data Analysis and this book.

American Standard Code for Information Interchange (ASCII) was the dominant encoding standard on the Internet until the end of 2007, with UTF-8 (8-bit Unicode) taking over. ASCII is limited to the English alphabet and has no support for other alphabets.

Analysis of variance (ANOVA) is a statistical data analysis method invented by statistician Ronald Fisher. This method partitions the data of a continuous variable using the values of one or more corresponding categorical variable to analyze variance. ANOVA is a form of linear modeling.

Anaconda is a free Python distribution for data analysis and scientific computing. It has its own package manager, conda.

The Anscombe's quartet is a classic example, which illustrates why visualizing data is important. The quartet consists of four datasets with similar statistical properties. Each dataset has a series of x values and dependent y values.

The bag-of-words model: A simplified model of text, in which text is represented by a bag (a set in which something can occur multiple times) of words. In this representation, the order of the words is ignored. Typically, word counts or the presence of certain words are used as features in this model.

Beta in finance is the slope of a linear regression model involving the returns of the asset and the returns of a benchmark, for instance, the S & P 500 index.

Caching involves storing results, usually from a function call, in memory or on disk. If done correctly, caching helps by reducing the number of function calls. In general, we want to keep the cache small for space reasons.

A clique is a subgraph that is complete. This is equivalent to the general concept of cliques, in which every person knows all the other people.

Clustering aims to partition data into groups called clusters. Clustering is unsupervised in the sense that the training data is not labeled. Some clustering algorithms require a guess for the number of clusters, while other algorithms don't.

Cohen's kappa measures agreement between the target and predicted class (in the context of classification)—similar to accuracy, but it also takes into account the random chance of getting the predictions. Kappa varies between negative values and one.

A complete graph is a graph in which every pair of nodes is connected by a unique connection.

The confusion matrix is a table usually used to summarize the results of classification. The two dimensions of the table are the predicted class and the target class.

Contingency table: A table containing counts for all combinations of the two categorical variables.

The cosine similarity is a common distance metric to measure the similarity of two documents. For this metric, we need to compute the inner product of two feature vectors. The cosine similarity of vectors corresponds to the cosine of the angle between vectors, hence the name.

Cross-correlation measures the correlation between two signals using a sliding inner product. We can use cross-correlation to measure the time delay between two signals.

The Data Science Toolbox (DST) is a virtual environment based on Ubuntu for data analysis using Python and R. Since DST is a virtual environment, we can install it on various operating systems.

The discrete cosine transform (DCT) is a transform similar to the Fourier transform, but it tries to represent a signal by a sum of cosine terms only.

The efficient-market hypothesis (EMH) stipulates that you can't, on average, "beat the market" by picking better stocks or timing the market. According to the EMH, all information about the market is immediately available to every market participant in one form or another and is immediately reflected in asset prices.

Eigenvalues are scalar solutions to the equation Ax = ax, where A is a two-dimensional matrix and x is a one-dimensional vector.

Eigenvectors are vectors corresponding to eigenvalues.

Exponential smoothing is a low-pass filter, which aims to remove noise.

Face detection tries to find (rectangular) areas in an image that represent faces.

Fast Fourier transform (FFT): A fast algorithm to compute Fourier transforms. FFT is O(N log N), which is a huge improvement on older algorithms.

Filtering is a type of signal processing technique, involving the removal or suppression of part of the signal. Many filter types exist, including the median and Wiener filters.

Fourier analysis is based on the Fourier series, named after the mathematician Joseph Fourier. The Fourier series is a mathematical method to represent functions as an infinite series of sine and cosine terms. The functions in question can be real or complex valued.

Genetic algorithms are based on the biological theory of evolution. This type of algorithm is useful for searching and optimization.

GPUs (graphical processor units) are specialized circuits used to display graphics efficiently. Recently, GPUs have been used to perform massively parallel computations (for instance, to train neural networks).

Hadoop Distributed File System (HDFS) is the storage component of the Hadoop framework for big data. HDFS is a distributed filesystem, which spreads data on multiple systems, and is inspired by Google File System, used by Google for its search engine.

A hive plot is a visualization technique for plotting network graphs. In hive plots, we draw edges as curved lines. We group nodes by some property and display them on radial axes.

Influence plots take into account residuals, influence, and leverage for individual data points, similar to bubble plots. The size of the residuals is plotted on the vertical axis and can indicate that a data point is an outlier.

Jackknifing is a deterministic algorithm to estimate confidence intervals. It falls under the family of resampling algorithms. Usually, we generate new datasets under the jackknifing algorithm by deleting one value (we can also delete two or more values).

JSON (JavaScript Object Notation) is a data format. In this format, data is written down using JavaScript notation. JSON is more succinct than other data formats, such as XML.

K-fold cross-validation is a form of cross-validation involving k (a small integer number) random data partitions called folds. In k iterations, each fold is used once for validation, and the rest of the data is used for training. The results of the iterations can be combined at the end.

Linear discriminant analysis (LDA) is an algorithm that looks for a linear combination of features in order to distinguish between classes. It can be used for classification or dimensionality reduction by projecting to a lower-dimensional subspace.

Learning curve: A way to visualize the behavior of a learning algorithm. It is a plot of training and test scores for a range of training data sizes.

Logarithmic plots (or log plots) are plots that use a logarithmic scale. This type of plot is useful when the data varies a lot, because they display orders of magnitude.

Logistic regression is a type of a classification algorithm. This algorithm can be used to predict probabilities associated with a class or an event occurring. Logistic regression is based on the logistic function, which has output values in the range from zero to one, just like in probabilities. The logistic function can therefore be used to transform arbitrary values into probabilities.

The Lomb-Scargle periodogram is a frequency spectrum estimation method that fits sines to data, and it is frequently used with unevenly sampled data. The method is named after Nicholas R. Lomb and Jeffrey D. Scargle.

The Matthews correlation coefficient (MCC) or phi coefficient is an evaluation metric for binary classification invented by Brian Matthews in 1975. The MCC is a correlation coefficient for target and predictions and varies between -1 and 1 (best agreement).

Memory leaks are a common issue of computer programs, which we can find by performing memory profiling. Leaks occur when we don't release memory that is not needed.

Moore's law is the observation that the number of transistors in a modern computer chip doubles every 2 years. This trend has continued since Moore's law was formulated, around 1970. There is also a second Moore's law, which is also known as Rock's law. This law states that the cost of R&D and manufacturing of integrated circuits increases exponentially.

Named-entity recognition (NER) tries to detect names of persons, organizations, locations, and others in text. Some NER systems are almost as good as humans, but it is not an easy task. Named entities usually start with upper case, such as Ivan. We should therefore not change the case of words when applying NER.

Object-relational mapping (ORM): A software architecture pattern for translation between database schemas and object-oriented programming languages.

Open Computing Language (OpenCL), initially developed by Apple Inc., is an open technology standard for programs, which can run on a variety of devices, including CPUs and GPUs that are available on commodity hardware.

OpenCV (Open Source Computer Vision) is a library for computer vision created in 2000 and currently maintained by Itseez. OpenCV is written in C++, but it also has bindings to Python and other programming languages.

Opinion mining or sentiment analysis is a research field with the goal of efficiently finding and evaluating opinions and sentiment in text.

Principal component analysis (PCA), invented by Karl Pearson in 1901, is an algorithm that transforms data into uncorrelated orthogonal features called principal components. The principal components are the eigenvectors of the covariance matrix.

The Poisson distribution is named after the French mathematician Poisson, who published it in 1837. The Poisson distribution is a discrete distribution usually associated with counts for a fixed interval of time or space.

Robust regression is designed to deal better with outliers in data than ordinary regression. This type of regression uses special robust estimators.

Scatter plot: A two-dimensional plot showing the relationship between two variables in a Cartesian coordinate system. The values of one variable are represented on one axis, and the other variable is represented by the other axis. We can quickly visualize correlation this way.

In the shared-nothing architecture, computing nodes don't share memory or files. The architecture is therefore totally decentralized, with completely independent nodes. The obvious advantage is that we are not dependent on any one node. The first commercial shared-nothing databases were created in the 1980s.

Signal processing is a field of engineering and applied mathematics that deals with the analysis of analog and digital signals corresponding to variables that vary with time.

Structured Query Language (SQL) is a specialized language for relational database querying and manipulation. This includes creating, inserting rows in, and deleting tables.

Short-time Fourier transform (STFT): The STFT splits a signal in the time domain into equal parts and then applies the FFT to each segment.

Stop words: Common words with low information value. Stop words are usually removed before analyzing text. Although filtering stop words is common practice, there is no standard definition of stop words.

The Spearman rank correlation uses ranks to correlate two variables with the Pearson correlation. Ranks are the positions of values in sorted order. Items with equal values get a rank, which is the average of their positions. For instance, if we have two items of equal value assigned positions 2 and 3, the rank is 2.5 for both items.

Spectral clustering is a clustering technique that can be used to segment images.

The star schema is a database pattern that facilitates reporting. Star schemas are appropriate for the processing of events such as website visits, ad clicks, or financial transactions. Event information (metrics such as temperature or purchase amount) is stored in fact tables linked to much smaller-dimension tables. Star schemas are denormalized, which places the responsibility of integrity checks on the application code. For this reason, we should only write to the database in a controlled manner.

Term frequency-inverse document frequency (tf-idf) is a metric measuring the importance of a word in a corpus. It is composed of a term frequency number and an inverse document frequency number. The term frequency counts the number of times a word occurs in a document. The inverse document frequency counts the number of documents in which the word occurs and takes the inverse of the number.

Time series: An ordered list of data points, starting with the oldest measurements. Usually, each data point has a related timestamp.

Violin plots combine box plots and kernel-density plots or histograms in one type of plot.

Winsorising is a technique to deal with outliers and is named after Charles Winsor. In effect, Winsorising clips outliers to given percentiles in a symmetric fashion.

Python Data Analysis Cookbook

By : Ivan Idris

Python Data Analysis Cookbook

By: Ivan Idris

Overview of this book

Related Content you might be interested in

Current Title:

Python Data Analysis Cookbook

Appendix A. Glossary