Book Image

Python Data Analysis Cookbook

By : Ivan Idris
Book Image

Python Data Analysis Cookbook

By: Ivan Idris

Overview of this book

Data analysis is a rapidly evolving field and Python is a multi-paradigm programming language suitable for object-oriented application development and functional design patterns. As Python offers a range of tools and libraries for all purposes, it has slowly evolved as the primary language for data science, including topics on: data analysis, visualization, and machine learning. Python Data Analysis Cookbook focuses on reproducibility and creating production-ready systems. You will start with recipes that set the foundation for data analysis with libraries such as matplotlib, NumPy, and pandas. You will learn to create visualizations by choosing color maps and palettes then dive into statistical data analysis using distribution algorithms and correlations. You’ll then help you find your way around different data and numerical problems, get to grips with Spark and HDFS, and then set up migration scripts for web mining. In this book, you will dive deeper into recipes on spectral analysis, smoothing, and bootstrapping methods. Moving on, you will learn to rank stocks and check market efficiency, then work with metrics and clusters. You will achieve parallelism to improve system performance by using multiple threads and speeding up your code. By the end of the book, you will be capable of handling various data analysis techniques in Python and devising solutions for problem scenarios.
Table of Contents (23 chapters)
Python Data Analysis Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Glossary
Index

Appendix A. Glossary

This appendix is a brief glossary of technical concepts used throughout Python Data Analysis and this book.

American Standard Code for Information Interchange (ASCII) was the dominant encoding standard on the Internet until the end of 2007, with UTF-8 (8-bit Unicode) taking over. ASCII is limited to the English alphabet and has no support for other alphabets.

Analysis of variance (ANOVA) is a statistical data analysis method invented by statistician Ronald Fisher. This method partitions the data of a continuous variable using the values of one or more corresponding categorical variable to analyze variance. ANOVA is a form of linear modeling.

Anaconda is a free Python distribution for data analysis and scientific computing. It has its own package manager, conda.

The Anscombe's quartet is a classic example, which illustrates why visualizing data is important. The quartet consists of four datasets with similar statistical properties. Each dataset has a series of x values and dependent y values.

The bag-of-words model: A simplified model of text, in which text is represented by a bag (a set in which something can occur multiple times) of words. In this representation, the order of the words is ignored. Typically, word counts or the presence of certain words are used as features in this model.

Beta in finance is the slope of a linear regression model involving the returns of the asset and the returns of a benchmark, for instance, the S & P 500 index.

Caching involves storing results, usually from a function call, in memory or on disk. If done correctly, caching helps by reducing the number of function calls. In general, we want to keep the cache small for space reasons.

A clique is a subgraph that is complete. This is equivalent to the general concept of cliques, in which every person knows all the other people.

Clustering aims to partition data into groups called clusters. Clustering is unsupervised in the sense that the training data is not labeled. Some clustering algorithms require a guess for the number of clusters, while other algorithms don't.

Cohen's kappa measures agreement between the target and predicted class (in the context of classification)—similar to accuracy, but it also takes into account the random chance of getting the predictions. Kappa varies between negative values and one.

A complete graph is a graph in which every pair of nodes is connected by a unique connection.

The confusion matrix is a table usually used to summarize the results of classification. The two dimensions of the table are the predicted class and the target class.

Contingency table: A table containing counts for all combinations of the two categorical variables.

The cosine similarity is a common distance metric to measure the similarity of two documents. For this metric, we need to compute the inner product of two feature vectors. The cosine similarity of vectors corresponds to the cosine of the angle between vectors, hence the name.

Cross-correlation measures the correlation between two signals using a sliding inner product. We can use cross-correlation to measure the time delay between two signals.

The Data Science Toolbox (DST) is a virtual environment based on Ubuntu for data analysis using Python and R. Since DST is a virtual environment, we can install it on various operating systems.

The discrete cosine transform (DCT) is a transform similar to the Fourier transform, but it tries to represent a signal by a sum of cosine terms only.

The efficient-market hypothesis (EMH) stipulates that you can't, on average, "beat the market" by picking better stocks or timing the market. According to the EMH, all information about the market is immediately available to every market participant in one form or another and is immediately reflected in asset prices.

Eigenvalues are scalar solutions to the equation Ax = ax, where A is a two-dimensional matrix and x is a one-dimensional vector.

Eigenvectors are vectors corresponding to eigenvalues.

Exponential smoothing is a low-pass filter, which aims to remove noise.

Face detection tries to find (rectangular) areas in an image that represent faces.

Fast Fourier transform (FFT): A fast algorithm to compute Fourier transforms. FFT is O(N log N), which is a huge improvement on older algorithms.

Filtering is a type of signal processing technique, involving the removal or suppression of part of the signal. Many filter types exist, including the median and Wiener filters.

Fourier analysis is based on the Fourier series, named after the mathematician Joseph Fourier. The Fourier series is a mathematical method to represent functions as an infinite series of sine and cosine terms. The functions in question can be real or complex valued.

Genetic algorithms are based on the biological theory of evolution. This type of algorithm is useful for searching and optimization.

GPUs (graphical processor units) are specialized circuits used to display graphics efficiently. Recently, GPUs have been used to perform massively parallel computations (for instance, to train neural networks).

Hadoop Distributed File System (HDFS) is the storage component of the Hadoop framework for big data. HDFS is a distributed filesystem, which spreads data on multiple systems, and is inspired by Google File System, used by Google for its search engine.

A hive plot is a visualization technique for plotting network graphs. In hive plots, we draw edges as curved lines. We group nodes by some property and display them on radial axes.

Influence plots take into account residuals, influence, and leverage for individual data points, similar to bubble plots. The size of the residuals is plotted on the vertical axis and can indicate that a data point is an outlier.

Jackknifing is a deterministic algorithm to estimate confidence intervals. It falls under the family of resampling algorithms. Usually, we generate new datasets under the jackknifing algorithm by deleting one value (we can also delete two or more values).

JSON (JavaScript Object Notation) is a data format. In this format, data is written down using JavaScript notation. JSON is more succinct than other data formats, such as XML.

K-fold cross-validation is a form of cross-validation involving k (a small integer number) random data partitions called folds. In k iterations, each fold is used once for validation, and the rest of the data is used for training. The results of the iterations can be combined at the end.

Linear discriminant analysis (LDA) is an algorithm that looks for a linear combination of features in order to distinguish between classes. It can be used for classification or dimensionality reduction by projecting to a lower-dimensional subspace.

Learning curve: A way to visualize the behavior of a learning algorithm. It is a plot of training and test scores for a range of training data sizes.

Logarithmic plots (or log plots) are plots that use a logarithmic scale. This type of plot is useful when the data varies a lot, because they display orders of magnitude.

Logistic regression is a type of a classification algorithm. This algorithm can be used to predict probabilities associated with a class or an event occurring. Logistic regression is based on the logistic function, which has output values in the range from zero to one, just like in probabilities. The logistic function can therefore be used to transform arbitrary values into probabilities.

The Lomb-Scargle periodogram is a frequency spectrum estimation method that fits sines to data, and it is frequently used with unevenly sampled data. The method is named after Nicholas R. Lomb and Jeffrey D. Scargle.

The Matthews correlation coefficient (MCC) or phi coefficient is an evaluation metric for binary classification invented by Brian Matthews in 1975. The MCC is a correlation coefficient for target and predictions and varies between -1 and 1 (best agreement).

Memory leaks are a common issue of computer programs, which we can find by performing memory profiling. Leaks occur when we don't release memory that is not needed.

Moore's law is the observation that the number of transistors in a modern computer chip doubles every 2 years. This trend has continued since Moore's law was formulated, around 1970. There is also a second Moore's law, which is also known as Rock's law. This law states that the cost of R&D and manufacturing of integrated circuits increases exponentially.

Named-entity recognition (NER) tries to detect names of persons, organizations, locations, and others in text. Some NER systems are almost as good as humans, but it is not an easy task. Named entities usually start with upper case, such as Ivan. We should therefore not change the case of words when applying NER.

Object-relational mapping (ORM): A software architecture pattern for translation between database schemas and object-oriented programming languages.

Open Computing Language (OpenCL), initially developed by Apple Inc., is an open technology standard for programs, which can run on a variety of devices, including CPUs and GPUs that are available on commodity hardware.

OpenCV (Open Source Computer Vision) is a library for computer vision created in 2000 and currently maintained by Itseez. OpenCV is written in C++, but it also has bindings to Python and other programming languages.

Opinion mining or sentiment analysis is a research field with the goal of efficiently finding and evaluating opinions and sentiment in text.

Principal component analysis (PCA), invented by Karl Pearson in 1901, is an algorithm that transforms data into uncorrelated orthogonal features called principal components. The principal components are the eigenvectors of the covariance matrix.

The Poisson distribution is named after the French mathematician Poisson, who published it in 1837. The Poisson distribution is a discrete distribution usually associated with counts for a fixed interval of time or space.

Robust regression is designed to deal better with outliers in data than ordinary regression. This type of regression uses special robust estimators.

Scatter plot: A two-dimensional plot showing the relationship between two variables in a Cartesian coordinate system. The values of one variable are represented on one axis, and the other variable is represented by the other axis. We can quickly visualize correlation this way.

In the shared-nothing architecture, computing nodes don't share memory or files. The architecture is therefore totally decentralized, with completely independent nodes. The obvious advantage is that we are not dependent on any one node. The first commercial shared-nothing databases were created in the 1980s.

Signal processing is a field of engineering and applied mathematics that deals with the analysis of analog and digital signals corresponding to variables that vary with time.

Structured Query Language (SQL) is a specialized language for relational database querying and manipulation. This includes creating, inserting rows in, and deleting tables.

Short-time Fourier transform (STFT): The STFT splits a signal in the time domain into equal parts and then applies the FFT to each segment.

Stop words: Common words with low information value. Stop words are usually removed before analyzing text. Although filtering stop words is common practice, there is no standard definition of stop words.

The Spearman rank correlation uses ranks to correlate two variables with the Pearson correlation. Ranks are the positions of values in sorted order. Items with equal values get a rank, which is the average of their positions. For instance, if we have two items of equal value assigned positions 2 and 3, the rank is 2.5 for both items.

Spectral clustering is a clustering technique that can be used to segment images.

The star schema is a database pattern that facilitates reporting. Star schemas are appropriate for the processing of events such as website visits, ad clicks, or financial transactions. Event information (metrics such as temperature or purchase amount) is stored in fact tables linked to much smaller-dimension tables. Star schemas are denormalized, which places the responsibility of integrity checks on the application code. For this reason, we should only write to the database in a controlled manner.

Term frequency-inverse document frequency (tf-idf) is a metric measuring the importance of a word in a corpus. It is composed of a term frequency number and an inverse document frequency number. The term frequency counts the number of times a word occurs in a document. The inverse document frequency counts the number of documents in which the word occurs and takes the inverse of the number.

Time series: An ordered list of data points, starting with the oldest measurements. Usually, each data point has a related timestamp.

Violin plots combine box plots and kernel-density plots or histograms in one type of plot.

Winsorising is a technique to deal with outliers and is named after Charles Winsor. In effect, Winsorising clips outliers to given percentiles in a symmetric fashion.