Mastering Python Data Analysis

Mastering Python Data Analysis

By : Magnus Vilhelm Persson

Buy this Book

Mastering Python Data Analysis

By: Magnus Vilhelm Persson

Buy this Book

Overview of this book

Python, a multi-paradigm programming language, has become the language of choice for data scientists for data analysis, visualization, and machine learning. Ever imagined how to become an expert at effectively approaching data analysis problems, solving them, and extracting all of the available information from your data? Well, look no further, this is the book you want! Through this comprehensive guide, you will explore data and present results and conclusions from statistical analysis in a meaningful way. You’ll be able to quickly and accurately perform the hands-on sorting, reduction, and subsequent analysis, and fully appreciate how data analysis methods can support business decision-making. You’ll start off by learning about the tools available for data analysis in Python and will then explore the statistical models that are used to identify patterns in data. Gradually, you’ll move on to review statistical inference using Python, Pandas, and SciPy. After that, we’ll focus on performing regression using computational tools and you’ll get to understand the problem of identifying clusters in data in an algorithmic way. Finally, we delve into advanced techniques to quantify cause and effect using Bayesian methods and you’ll discover how to use Python’s tools for supervised machine learning.

Mastering Python Data Analysis

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Tools of the Trade

Before you start

Using the notebook interface

Imports

An example using the Pandas library

Summary

Exploring Data

The General Social Survey

Univariate data

Relationships between variables – scatterplots

Summary

Learning About Models

Models and experiments

The cumulative distribution function

Working with distributions

The probability density function

Where do models come from?

Multivariate distributions

Summary

Regression

Introducing linear regression

Multivariate regression

Logistic regression

Summary

Clustering

Introduction to cluster finding

K-means clustering

Hierarchical clustering analysis

Summary

Bayesian Methods

The Bayesian method

U.S. air travel safety record

Climate change - CO in the atmosphere

Summary

Supervised and Unsupervised Learning

Introduction to machine learning

Summary

Time Series Analysis

Introduction

Pandas and time series data

Indexing and slicing

Resampling, smoothing, and other estimates

Stationarity

Patterns and components

Time series models

Summary

More on Jupyter Notebook and matplotlib Styles

Jupyter Notebook

Matplotlib styles

Useful resources

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Before you start

We assume that you have familiarity with Python and have already developed and run some scripts or used Python interactively, either in the shell or on another interface, such as the Jupyter Notebook (formerly known as the IPython notebook). Hence, we also assume that you have a working installation of Python. In this book, we assume that you have installed Python 3.4 or later.

We also assume that you have developed your own workflow with Python, based on needs and available environment. To follow the examples in this book, you are expected to have access to a working installation of Python 3.4 or later. There are two alternatives to get started, as outlined in the following list:

Use a Python installation from scratch. This can be downloaded from https://www.python.org . This will require a separate installation for each of the required libraries.
Install a prepackaged distribution containing libraries for scientific and data computing. Two popular distributions are Anaconda Scientific Python ( https://store.continuum.io/cshop/anaconda ) and Enthought distribution ( https://www.enthought.com ).

Tip

Even if you have a working Python installation, you might want to try one of the prepackaged distributions. They contain a well-rounded collection of packages and modules suitable for data analysis and scientific computing. If you choose this path, all the libraries in the next list are included by default.

We also assume that you have the libraries in the following list:

numpy and scipy: These are available at http://www.scipy.org . These are the essential Python libraries for computational work. NumPy defines a fast and flexible array data structure, and SciPy has a large collection of functions for numerical computing. They are required by some of the libraries mentioned in the list.
matplotlib: This is available at http://matplotlib.org . It is a library for interactive graphics built on top of NumPy. I recommend versions above 1.5, which is what is included in Anaconda Python by default.
pandas: This is available at http://pandas.pydata.org . It is a Python data analysis library. It will be used extensively throughout the book.
pymc: This is a library to make Bayesian models and fitting in Python accessible and straightforward. It is available at http://pymc-devs.github.io/pymc/ . This package will mainly be used in Chapter 6 , Bayesian Methods, of this book.
scikit-learn: This is available at http://scikit-learn.org. It is a library for machine learning in Python. This package is used in Chapter 7, Supervised and Unsupervised Learning.
IPython: This is available at http://ipython.org. It is a library providing enhanced tools for interactive computations in Python from the command line.
Jupyter: This is available at https://jupyter.org/ . It is the notebook interface working on top of IPython (and other programming languages). Originally part of the IPython project, the notebook interface is a web-based platform for computational and data science that allows easy integration of the tools that are used in this book.

Notice that each of the libraries in the preceding list may have several dependencies, which must also be separately installed. To test the availability of any of the packages, start a Python shell and run the corresponding import statement. For example, to test the availability of NumPy, run the following command:

import numpy

If NumPy is not installed in your system, this will produce an error message. An alternative approach that does not require starting a Python shell is to run the command line:

python -c 'import numpy'

We also assume that you have either a programmer's editor or Python IDE. There are several options, but at the basic level, any editor capable of working with unformatted text files will do.

Mastering Python Data Analysis

By : Magnus Vilhelm Persson

Mastering Python Data Analysis

By: Magnus Vilhelm Persson

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Python Data Analysis

Before you start

Tip