Python Data Analysis

Python Data Analysis - Second Edition

By : Ivan Idris

Buy this Book

Python Data Analysis - Second Edition

By: Ivan Idris

Buy this Book

Overview of this book

Data analysis techniques generate useful insights from small and large volumes of data. Python, with its strong set of libraries, has become a popular platform to conduct various data analysis and predictive modeling tasks. With this book, you will learn how to process and manipulate data with Python for complex analysis and modeling. We learn data manipulations such as aggregating, concatenating, appending, cleaning, and handling missing values, with NumPy and Pandas. The book covers how to store and retrieve data from various data sources such as SQL and NoSQL, CSV fies, and HDF5. We learn how to visualize data using visualization libraries, along with advanced topics such as signal processing, time series, textual data analysis, machine learning, and social media analysis. The book covers a plethora of Python modules, such as matplotlib, statsmodels, scikit-learn, and NLTK. It also covers using Python with external environments such as R, Fortran, C/C++, and Boost libraries.

Python Data Analysis - Second Edition

Credits

About the Author

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Getting Started with Python Libraries

Installing Python 3

Using IPython as a shell

Where to find help and references

Listing modules inside the Python libraries

Visualizing data using Matplotlib

Summary

NumPy Arrays

The NumPy array object

Creating a multidimensional array

Selecting NumPy array elements

NumPy numerical types

One-dimensional slicing and indexing

Manipulating array shapes

Creating array views and copies

Fancy indexing

Indexing with a list of locations

Indexing NumPy arrays with Booleans

Broadcasting NumPy arrays

Summary

References

The Pandas Primer

Installing and exploring Pandas

The Pandas DataFrames

The Pandas Series

Querying data in Pandas

Statistics with Pandas DataFrames

Data aggregation with Pandas DataFrames

Concatenating and appending DataFrames

Joining DataFrames

Handling missing values

Dealing with dates

Pivot tables

Summary

References

Statistics and Linear Algebra

Basic descriptive statistics with NumPy

Linear algebra with NumPy

Finding eigenvalues and eigenvectors with NumPy

NumPy random numbers

Creating a NumPy masked array

Summary

Retrieving, Processing, and Storing Data

Writing CSV files with NumPy and Pandas

The binary .npy and pickle formats

Storing data with PyTables

Reading and writing Pandas DataFrames to HDF5 stores

Reading and writing to Excel with Pandas

Using REST web services and JSON

Reading and writing JSON with Pandas

Parsing RSS and Atom feeds

Parsing HTML with Beautiful Soup

Summary

Reference

Data Visualization

The matplotlib subpackages

Basic matplotlib plots

Logarithmic plots

Scatter plots

Legends and annotations

Three-dimensional plots

Plotting in Pandas

Lag plots

Autocorrelation plots

Plot.ly

Summary

Signal Processing and Time Series

The statsmodels modules

Moving averages

Window functions

Defining cointegration

Autocorrelation

Autoregressive models

ARMA models

Generating periodic signals

Fourier analysis

Spectral analysis

Filtering

Summary

Working with Databases

Lightweight access with sqlite3

Accessing databases from Pandas

SQLAlchemy

Pony ORM

Dataset - databases for lazy people

PyMongo and MongoDB

Storing data in Redis

Storing data in memcache

Apache Cassandra

Summary

Analyzing Textual Data and Social Media

Installing NLTK

About NLTK

Filtering out stopwords, names, and numbers

The bag-of-words model

Analyzing word frequencies

Naive Bayes classification

Sentiment analysis

Creating word clouds

Social network analysis

Summary

Predictive Analytics and Machine Learning

Preprocessing

Classification with logistic regression

Classification with support vector machines

Regression with ElasticNetCV

Support vector regression

Clustering with affinity propagation

Summary

Environments Outside the Python Ecosystem and Cloud Computing

Exchanging information with Matlab/Octave

Installing rpy2 package

Interfacing with R

Sending NumPy arrays to Java

Integrating SWIG and NumPy

Integrating Boost and Python

Using Fortran code through f2py

PythonAnywhere Cloud

Summary

Performance Tuning, Profiling, and Concurrency

Profiling the code

Installing Cython

Calling C code

Creating a process pool with multiprocessing

Speeding up embarrassingly parallel for loops with Joblib

Comparing Bottleneck to NumPy functions

Performing MapReduce with Jug

Installing MPI for Python

IPython Parallel

Summary

Key Concepts

Useful Functions

Matplotlib

NumPy

Pandas

Scikit-learn

SciPy

Online Resources

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Appendix A. Key Concepts

This appendix gives a brief overview and glossary of technical concepts used throughout the book.

Amdahl's law predicts the maximum possible speedup due to parallelization. The number of processes limits the absolute maximum speedup. Some parts of any given Python code might be impossible to parallelize. We also have to take into account overhead from parallelization setup and related interprocess communication. Amdahl's law states that there is a linear relationship between the inverse of the speedup, the inverse of the number of processes, and the portion of the code that cannot be parallelized.

ARMA models combine autoregressive and moving average models. They are used to forecast future values of time series.

Artificial Neural Networks (ANN) are models inspired by the animal brain. A neural network is a network of neurons--units with inputs and outputs. The output of a neuron can be passed to a neuron and so on, thus creating a multilayered network. Neural networks contain adaptive elements, making them suitable to deal with nonlinear models and pattern recognition problems.

Augmented Dickey Fuller (ADF) test is a statistical test related to cointegration. ADF test is used to check stationarity of a time series.

Autocorrelation is the correlation within a dataset and can indicate a trend. For example, if we have a lag of one period, we can check whether the previous value influences the current value. For that to be true, the autocorrelation value has to be pretty high.

Autocorrelation plots graph autocorrelations of time series data for different lags. Autocorrelation is the correlation of a time series with the same lagged time series.

The autoregressive model is a model that uses (usually linear) regression to forecast future values of a time series using previous values. Autoregressive models are a special case of the ARMA models. They are equivalent to ARMA models with zero moving average components.

The bag-of-words model is a simplified model of text, in which the text is represented by a bag of words. In this representation, the order of the words is ignored. Typically, word counts or the presence of certain words are used as features in this model.

Bubble charts are an extension of the scatter plot. In a bubble chart, the value of a third variable is represented by the size of the bubble surrounding a data point.

Cassandra Query Language (CQL) is a query language for Apache Cassandra with a syntax similar to SQL.

Cointegration is similar to correlation and is a statistical characteristic of time series data. Cointegration is a measure of how synchronized two time series are.

Clustering aims to partition data into groups called clusters. Clustering is usually unsupervised in the sense that the training data is not labeled. Some clustering algorithms require a guess for the number of clusters, while other algorithms don't.

Cascading Style Sheets (CSS) is a language used to style elements of a web page. CSS is maintained and developed by the World Wide Web Consortium.

CSS selectors are rules used to select content in a web page.

Character codes are included in NumPy for backward compatibility with Numeric. Numeric is the predecessor of NumPy.

Data type objects are instances of the numpy.dtype class. They provide an object-oriented interface for manipulation of NumPy data types.

Eigenvalues are scalar solutions to the equation Ax = ax, where A is a two-dimensional matrix and x is a one-dimensional vector.

Eigenvectors are vectors corresponding to eigenvalues.

The exponential moving average is a type of moving average with exponentially decreasing weights with time.

Fast Fourier Transform (FFT) is a fast algorithm to compute the Fourier transform. FFT is O(N log N), which is a huge improvement over older algorithms.

Filtering is a type of signal-processing technique, which involves removing or suppressing part of the signal. Many filter types exist including the median and Wiener filter.

Fourier analysis is based on the Fourier series named after the mathematician Joseph Fourier. The Fourier series is a mathematical method to represent functions as an infinite series of sine and cosine terms. The functions in question can be real or complex valued.

Genetic algorithms are based on the biological theory of evolution. This type of algorithms is useful for searching and optimization.

Graphical Processor Units (GPUs) are specialized circuits used to display graphics efficiently. Recently, GPUs have been used to perform massively parallel computations (for instance, to train neural networks).

The Hierarchical Data Format (HDF) is a specification and technology for the storage of big numerical data. The HDF group maintains a related software library.

The Hilbert-Huang transform is a mathematical algorithm to decompose a signal. This method can be used to detect periodic cycles in time series data. It was used successfully to determine sunspot cycles.

HyperText Markup Language (HTML) is the fundamental technology used to create web pages. It defines tags for media, text, and hyperlinks.

The Internet Engineering Task Force (IETF) is an open group working on maintaining and developing the Internet. IETF is open in the sense that anybody can join in principle.

JavaScript Object Notation (JSON) is a data format. In this format, data is written down using JavaScript notation. JSON is more succinct than other data formats such as XML.

k-fold cross-validation is a form of cross-validation involving k (a small integer number) random data partitions called folds. In k iterations, each fold is used once for validation and the rest of the data is used for training. The results of the iterations can be combined at the end.

Kruskal-Wallis one-way analysis of variance is a statistical method that analyzes sample variance without making assumptions about their distributions.

The lag plot is a scatter plot for a time series and the same time series lagged. A lag plot shows autocorrelation within time series data for a certain lag.

The learning curve is a way to visualize the behavior of a learning algorithm. It is a plot of training and test scores for a range of train data sizes.

Logarithmic plots (or log plots) are plots that use a logarithmic scale. This type of plots is useful when the data varies a lot because they display orders of magnitude.

Logistic regression is a type of a classification algorithm. This algorithm can be used to predict probabilities associated with a class or an event occurring. Logistic regression is based on the logistic function, which has values in the range between zero and one, just like in probabilities. The logistic function can therefore be used to transform arbitrary values into probabilities.

MapReduce is a distributed algorithm used to process large datasets with a cluster of computers. The algorithm consists of Map and Reduce phases. During the Map phase, data is processed in parallel fashion. The data is split up in parts and on each part, filtering or other operations are performed. In the Reduce phase, the results from the Map phase are aggregated.

Moore's law is the observation that the number of transistors in a modern computer chip doubles every two years. This trend has continued since Moore's law formulation around 1970. There is also a second Moore's law, which is also known as Rock's law. This law states that the cost of R and D and manufacturing of integrated circuits increases exponentially.

Moving averages specify a window of previously seen data that is averaged each time the window slides forward by one period. The different types of moving average differ essentially in the weights used for averaging.

Naive Bayes classification is a probabilistic classification algorithm based on Bayes theorem from probability theory and statistics. It is called naive because of its strong independence assumptions.

Object-relational mapping (ORM) is a software architecture pattern for translation between database schemas and object-oriented programming languages.

Opinion mining or sentiment analysis is a research field with the goal of efficiently finding and evaluating opinions and sentiments in text.

Part of Speech (POS) tags are tags for each word in a sentence. These tags have a grammatical meaning such as a verb or noun.

Representational State Transfer (REST) is an architectural style for web services.

Really Simple Syndication (RSS) is a standard for the publication and retrieval of web feeds such as blogs.

The scatter plot is a two-dimensional plot showing the relationship between two variables in a Cartesian coordinate system. The values of one variable are represented on one axis and the values of the other variable are represented by the other axis. We can quickly visualize correlation this way.

Signal processing is a field of engineering and applied mathematics that handles the analysis of analog and digital signals, corresponding to variables that vary with time.

SQL is a specialized language for relational database querying and manipulation. This includes creating tables, inserting rows in tables, and deleting tables.

Stopwords are common words with low-information value. Stopwords are usually removed before analyzing text. Although filtering stopwords is a common practice, there is no standard definition for stopwords.

Supervised learning is a type of machine learning that requires labeled training data.

Support vector machines (SVM) can be used for regression (SVR) and classification (SVC). SVM maps the data points to points in a multidimensional space. The mapping is performed by a so-called kernel function. The kernel function can be linear or nonlinear.

Term frequency-inverse document frequency (tf-idf) is a metric measuring the importance of a word in a corpus. It is composed of a term frequency number and an inverse document frequency number. The term frequency counts the number of times a word occurs in a document. The inverse document frequency counts the number of documents in which the word occurs and takes the inverse of the number.

A time series is an ordered list of data points starting with the oldest measurements first. Usually, each data point has a related timestamp. A time series could be stationary or non-staionary.

Python Data Analysis - Second Edition

By : Ivan Idris

Python Data Analysis - Second Edition

By: Ivan Idris

Overview of this book

Related Content you might be interested in

Current Title:

Python Data Analysis - Second Edition

Learning pandas

Mastering Numerical Computing with NumPy

Appendix A. Key Concepts