Python Data Analysis

Python Data Analysis - Second Edition

By : Ivan Idris

Buy this Book

Python Data Analysis - Second Edition

By: Ivan Idris

Buy this Book

Overview of this book

Data analysis techniques generate useful insights from small and large volumes of data. Python, with its strong set of libraries, has become a popular platform to conduct various data analysis and predictive modeling tasks. With this book, you will learn how to process and manipulate data with Python for complex analysis and modeling. We learn data manipulations such as aggregating, concatenating, appending, cleaning, and handling missing values, with NumPy and Pandas. The book covers how to store and retrieve data from various data sources such as SQL and NoSQL, CSV fies, and HDF5. We learn how to visualize data using visualization libraries, along with advanced topics such as signal processing, time series, textual data analysis, machine learning, and social media analysis. The book covers a plethora of Python modules, such as matplotlib, statsmodels, scikit-learn, and NLTK. It also covers using Python with external environments such as R, Fortran, C/C++, and Boost libraries.

Python Data Analysis - Second Edition

Credits

About the Author

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Getting Started with Python Libraries

Installing Python 3

Using IPython as a shell

Where to find help and references

Listing modules inside the Python libraries

Visualizing data using Matplotlib

Summary

NumPy Arrays

The NumPy array object

Creating a multidimensional array

Selecting NumPy array elements

NumPy numerical types

One-dimensional slicing and indexing

Manipulating array shapes

Creating array views and copies

Fancy indexing

Indexing with a list of locations

Indexing NumPy arrays with Booleans

Broadcasting NumPy arrays

Summary

References

The Pandas Primer

Installing and exploring Pandas

The Pandas DataFrames

The Pandas Series

Querying data in Pandas

Statistics with Pandas DataFrames

Data aggregation with Pandas DataFrames

Concatenating and appending DataFrames

Joining DataFrames

Handling missing values

Dealing with dates

Pivot tables

Summary

References

Statistics and Linear Algebra

Basic descriptive statistics with NumPy

Linear algebra with NumPy

Finding eigenvalues and eigenvectors with NumPy

NumPy random numbers

Creating a NumPy masked array

Summary

Retrieving, Processing, and Storing Data

Writing CSV files with NumPy and Pandas

The binary .npy and pickle formats

Storing data with PyTables

Reading and writing Pandas DataFrames to HDF5 stores

Reading and writing to Excel with Pandas

Using REST web services and JSON

Reading and writing JSON with Pandas

Parsing RSS and Atom feeds

Parsing HTML with Beautiful Soup

Summary

Reference

Data Visualization

The matplotlib subpackages

Basic matplotlib plots

Logarithmic plots

Scatter plots

Legends and annotations

Three-dimensional plots

Plotting in Pandas

Lag plots

Autocorrelation plots

Plot.ly

Summary

Signal Processing and Time Series

The statsmodels modules

Moving averages

Window functions

Defining cointegration

Autocorrelation

Autoregressive models

ARMA models

Generating periodic signals

Fourier analysis

Spectral analysis

Filtering

Summary

Working with Databases

Lightweight access with sqlite3

Accessing databases from Pandas

SQLAlchemy

Pony ORM

Dataset - databases for lazy people

PyMongo and MongoDB

Storing data in Redis

Storing data in memcache

Apache Cassandra

Summary

Analyzing Textual Data and Social Media

Installing NLTK

About NLTK

Filtering out stopwords, names, and numbers

The bag-of-words model

Analyzing word frequencies

Naive Bayes classification

Sentiment analysis

Creating word clouds

Social network analysis

Summary

Predictive Analytics and Machine Learning

Preprocessing

Classification with logistic regression

Classification with support vector machines

Regression with ElasticNetCV

Support vector regression

Clustering with affinity propagation

Summary

Environments Outside the Python Ecosystem and Cloud Computing

Exchanging information with Matlab/Octave

Installing rpy2 package

Interfacing with R

Sending NumPy arrays to Java

Integrating SWIG and NumPy

Integrating Boost and Python

Using Fortran code through f2py

PythonAnywhere Cloud

Summary

Performance Tuning, Profiling, and Concurrency

Profiling the code

Installing Cython

Calling C code

Creating a process pool with multiprocessing

Speeding up embarrassingly parallel for loops with Joblib

Comparing Bottleneck to NumPy functions

Performing MapReduce with Jug

Installing MPI for Python

IPython Parallel

Summary

Key Concepts

Useful Functions

Matplotlib

NumPy

Pandas

Scikit-learn

SciPy

Online Resources

Customer Reviews

5 star

4 star

3 star

2 star

1 star

A simple application

Imagine that we want to add two vectors called a and b. The word vector is used here in the mathematical sense, which means a one-dimensional array. We will learn about specialized NumPy arrays that represent matrices in Chapter 4, Statistics and Linear Algebra. The vector a holds the squares of integers 0 to n; for instance, if n is equal to 3, a contains 0, 1, or 4. The vector b holds the cubes of integers 0 to n, so if n is equal to 3, then the vector b is equal to 0, 1, or 8. How would you do that using plain Python? After we come up with a solution, we will compare it to the NumPy equivalent.

The following function solves the vector addition problem using pure Python without NumPy:

def pythonsum(n): 
   a = list(range(n)) 
   b = list(range(n)) 
   c = [] 
 
   for i in range(len(a)): 
       a[i] = i ** 2 
       b[i] = i ** 3 
       c.append(a[i] + b[i]) 
 
   return c

The following is a function that solves the vector addition problem with NumPy:

def numpysum(n): 
  a = numpy.arange(n) ** 2 
  b = numpy.arange(n) ** 3 
  c = a + b 
  return c

Note that numpysum() does not need a for loop. We also used the arange() function from NumPy, which creates a NumPy array for us with integers from 0 to n. The arange() function was imported; that is why it is prefixed with numpy.

Now comes the fun part. We mentioned earlier that NumPy is faster when it comes to array operations. How much faster is Numpy, though? The following program will show us by measuring the elapsed time in microseconds for the numpysum() and pythonsum() functions. It also prints the last two elements of the vector sum. Let's check that we get the same answers using Python and NumPy:

#!/usr/bin/env/python 
 
import sys 
from datetime import datetime 
import numpy as np 
 
""" 
This program demonstrates vector addition the Python way. 
Run the following from the command line: 
 
  python vectorsum.py n 
 
Here, n is an integer that specifies the size of the vectors. 
 
The first vector to be added contains the squares of 0 up to n. 
The second vector contains the cubes of 0 up to n. 
The program prints the last 2 elements of the sum and the elapsed  time: 
""" 
 
def numpysum(n): 
   a = np.arange(n) ** 2 
   b = np.arange(n) ** 3 
   c = a + b 
 
   return c 
 
def pythonsum(n): 
   a = list(range(n)) 
   b = list(range(n)) 
   c = [] 
 
   for i in range(len(a)): 
       a[i] = i ** 2 
       b[i] = i ** 3 
       c.append(a[i] + b[i]) 
 
   return c 
 
size = int(sys.argv[1]) 
 
start = datetime.now() 
c = pythonsum(size) 
delta = datetime.now() - start 
print("The last 2 elements of the sum", c[-2:]) 
print("PythonSum elapsed time in microseconds", delta.microseconds) 
 
start = datetime.now() 
c = numpysum(size) 
delta = datetime.now() - start 
print("The last 2 elements of the sum", c[-2:]) 
print("NumPySum elapsed time in microseconds", delta.microseconds)

The output of the program for 1000, 2000, and 3000 vector elements is as follows:

$ python3 vectorsum.py 1000
The last 2 elements of the sum [995007996, 998001000]
PythonSum elapsed time in microseconds 976
The last 2 elements of the sum [995007996 998001000]
NumPySum elapsed time in microseconds 87
$ python3 vectorsum.py 2000
The last 2 elements of the sum [7980015996, 7992002000]
PythonSum elapsed time in microseconds 1623
The last 2 elements of the sum [7980015996 7992002000]
NumPySum elapsed time in microseconds 143
$ python3 vectorsum.py 4000
The last 2 elements of the sum [63920031996, 63968004000]
PythonSum elapsed time in microseconds 3417
The last 2 elements of the sum [63920031996 63968004000]
NumPySum elapsed time in microseconds 237

Clearly, NumPy is much faster than the equivalent normal Python code. One thing is certain; we get the same results whether we are using NumPy or not. However, the result that is printed differs in representation. Note that the result from the numpysum() function does not have any commas. How come? Obviously, we are not dealing with a Python list, but with a NumPy array. We will learn more about NumPy arrays in the Chapter 2, NumPy Arrays.

Python Data Analysis - Second Edition

By : Ivan Idris

Python Data Analysis - Second Edition

By: Ivan Idris

Overview of this book

Related Content you might be interested in

Current Title:

Python Data Analysis - Second Edition

Learning pandas

Mastering Numerical Computing with NumPy

A simple application