Python Data Analysis

Book Image

Python Data Analysis

By : Ivan Idris

Book Image

Python Data Analysis

By: Ivan Idris

Overview of this book

Python Data Analysis

Python Data Analysis

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Getting Started with Python Libraries

Getting Started with Python Libraries

Software used in this book

Building NumPy, SciPy, matplotlib, and IPython from source

Installing with setuptools

A simple application

Using IPython as a shell

Reading manual pages

IPython notebooks

Where to find help and references

NumPy Arrays

The NumPy array object

Creating a multidimensional array

Selecting NumPy array elements

NumPy numerical types

One-dimensional slicing and indexing

Manipulating array shapes

Creating array views and copies

Indexing with a list of locations

Indexing NumPy arrays with Booleans

Broadcasting NumPy arrays

Statistics and Linear Algebra

Statistics and Linear Algebra

NumPy and SciPy modules

Basic descriptive statistics with NumPy

Linear algebra with NumPy

Finding eigenvalues and eigenvectors with NumPy

NumPy random numbers

Creating a NumPy-masked array

pandas Primer

Installing and exploring pandas

pandas DataFrames

Querying data in pandas

Statistics with pandas DataFrames

Data aggregation with pandas DataFrames

Concatenating and appending DataFrames

Joining DataFrames

Handling missing values

Dealing with dates

Remote data access

Retrieving, Processing, and Storing Data

Retrieving, Processing, and Storing Data

Writing CSV files with NumPy and pandas

Comparing the NumPy .npy binary format and pickling pandas DataFrames

Storing data with PyTables

Reading and writing pandas DataFrames to HDF5 stores

Reading and writing to Excel with pandas

Using REST web services and JSON

Reading and writing JSON with pandas

Parsing RSS and Atom feeds

Parsing HTML with Beautiful Soup

Data Visualization

Data Visualization

matplotlib subpackages

Basic matplotlib plots

Logarithmic plots

Legends and annotations

Three-dimensional plots

Plotting in pandas

Autocorrelation plots

Signal Processing and Time Series

Signal Processing and Time Series

statsmodels subpackages

Moving averages

Window functions

Defining cointegration

Autocorrelation

Autoregressive models

Generating periodic signals

Fourier analysis

Spectral analysis

Working with Databases

Working with Databases

Lightweight access with sqlite3

Accessing databases from pandas

Dataset – databases for lazy people

PyMongo and MongoDB

Storing data in Redis

Apache Cassandra

Analyzing Textual Data and Social Media

Analyzing Textual Data and Social Media

Installing NLTK

Filtering out stopwords, names, and numbers

The bag-of-words model

Analyzing word frequencies

Naive Bayes classification

Sentiment analysis

Creating word clouds

Social network analysis

Predictive Analytics and Machine Learning

Predictive Analytics and Machine Learning

A tour of scikit-learn

Classification with logistic regression

Classification with support vector machines

Regression with ElasticNetCV

Support vector regression

Clustering with affinity propagation

Genetic algorithms

Neural networks

Environments Outside the Python Ecosystem and Cloud Computing

Environments Outside the Python Ecosystem and Cloud Computing

Exchanging information with MATLAB/Octave

Installing rpy2

Interfacing with R

Sending NumPy arrays to Java

Integrating SWIG and NumPy

Integrating Boost and Python

Using Fortran code through f2py

Setting up Google App Engine

Running programs on PythonAnywhere

Working with Wakari

Performance Tuning, Profiling, and Concurrency

Performance Tuning, Profiling, and Concurrency

Profiling the code

Installing Cython

Creating a process pool with multiprocessing

Speeding up embarrassingly parallel for loops with Joblib

Comparing Bottleneck to NumPy functions

Performing MapReduce with Jug

Installing MPI for Python

IPython Parallel

Key Concepts

Useful Functions

Useful Functions

Online Resources

Online Resources

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Pivot tables

A pivot table, as known from Excel, summarizes data. The data in CSV files that we have seen in this chapter so far has been in flat files. The pivot table aggregates data from a flat file for certain columns and rows. The aggregating operation can be sum, mean, standard deviations, and so on. We will reuse the data generating code from data_aggregation.py. The pandas API has a top-level pivot_table() function and corresponding DataFrame method. With the aggfunc parameter, we can specify the aggregation function to use the NumPy sum() function, for instance. The cols parameter tells pandas the column to be aggregated. Create a pivot table on the Food column as follows:

print pd.pivot_table(df, cols=['Food'], aggfunc=np.sum)

The pivot table we get contains totals for each food item:

Food    chocolate   icecream      soup
Number   8.000000  15.000000  19.00000
Price    5.986585  10.440071  13.83338

[2 rows x 3 columns]

The following code can be found in pivot_demo.py in this book...