Book Image

Mastering pandas - Second Edition

By : Ashish Kumar

Book Image

Mastering pandas - Second Edition

By: Ashish Kumar

Overview of this book

pandas is a popular Python library used by data scientists and analysts worldwide to manipulate and analyze their data. This book presents useful data manipulation techniques in pandas to perform complex data analysis in various domains. An update to our highly successful previous edition with new features, examples, updated code, and more, this book is an in-depth guide to get the most out of pandas for data analysis. Designed for both intermediate users as well as seasoned practitioners, you will learn advanced data manipulation techniques, such as multi-indexing, modifying data structures, and sampling your data, which allow for powerful analysis and help you gain accurate insights from it. With the help of this book, you will apply pandas to different domains, such as Bayesian statistics, predictive analytics, and time series analysis using an example-based approach. And not just that; you will also learn how to prepare powerful, interactive business reports in pandas using the Jupyter notebook. By the end of this book, you will learn how to perform efficient data analysis using pandas on complex data, and become an expert data analyst or data scientist in the process.

Preface

Who this book is for

What this book covers

To get the most out of this book

Free Chapter

Section 1: Overview of Data Analysis and pandas

Section 1: Overview of Data Analysis and pandas

Introduction to pandas and Data Analysis

Introduction to pandas and Data Analysis

Motivation for data analysis

Data analytics pipeline

How Python and pandas fit into the data analytics pipeline

What is pandas?

Where does pandas fit in the pipeline?

Benefits of using pandas

History of pandas

Usage pattern and adoption of pandas

pandas on the technology adoption curve

Popular applications of pandas

Installation of pandas and Supporting Software

Installation of pandas and Supporting Software

Selecting a version of Python to use

Standalone Python installation

Installation of Python and pandas using Anaconda

Dependency packages for pandas

Review of items installed with Anaconda

Cross tooling – combining pandas awesomeness with R, Julia, H20.ai, and Azure ML Studio

Command line tricks for pandas

Options and settings for pandas

Further reading

Section 2: Data Structures and I/O in pandas

Section 2: Data Structures and I/O in pandas

Using NumPy and Data Structures with pandas

Using NumPy and Data Structures with pandas

Implementing neural networks with NumPy

Practical applications of multidimensional arrays

Data structures in pandas

I/Os of Different Data Formats with pandas

I/Os of Different Data Formats with pandas

Data sources and pandas methods

Reading HDF formats

Reading feather files

Reading parquet files

Reading a SQL file

Reading a SAS/Stata file

Reading from Google BigQuery

Reading from a clipboard

Managing sparse data

Writing JSON objects to a file

Serialization/deserialization

Writing to exotic file types

Open source APIs – Quandl

Pandas plotting

pandas-datareader

Section 3: Mastering Different Data Operations in pandas

Section 3: Mastering Different Data Operations in pandas

Indexing and Selecting in pandas

Indexing and Selecting in pandas

Labels, integer, and mixed indexing

Boolean indexing

Operations on indexes

Grouping, Merging, and Reshaping Data in pandas

Grouping, Merging, and Reshaping Data in pandas

Merging and joining

Pivots and reshaping data

Other methods for reshaping DataFrames

Special Data Operations in pandas

Special Data Operations in pandas

Writing and applying one-liner custom functions

Handling missing values

A survey of methods on series

pandas string methods

Binary operations on DataFrames and series

Using mathematical methods on DataFrames

Time Series and Plotting Using Matplotlib

Time Series and Plotting Using Matplotlib

Handling time series data

A summary of time series-related objects

Plotting using matplotlib

Section 4: Going a Step Beyond with pandas

Section 4: Going a Step Beyond with pandas

Making Powerful Reports In Jupyter Using pandas

Making Powerful Reports In Jupyter Using pandas

Navigating Jupyter Notebook

A Tour of Statistics with pandas and NumPy

A Tour of Statistics with pandas and NumPy

Descriptive statistics versus inferential statistics

Measures of central tendency and variability

Hypothesis testing – the null and alternative hypotheses

A Brief Tour of Bayesian Statistics and Maximum Likelihood Estimates

A Brief Tour of Bayesian Statistics and Maximum Likelihood Estimates

Introduction to Bayesian statistics

The mathematical framework for Bayesian statistics

Probability distributions

Bayesian statistics versus frequentist statistics

Conducting Bayesian statistical analysis

Monte Carlo estimation of the likelihood function and PyMC

Data Case Studies Using pandas

Data Case Studies Using pandas

End-to-end exploratory data analysis

Web scraping with Python

Data validation

The pandas Library Architecture

The pandas Library Architecture

Understanding the pandas file hierarchy

Improving performance using Python extensions

pandas Compared with Other Tools

pandas Compared with Other Tools

Comparison with R

Slicing and selection

Comparison with SQL

Comparison with SAS

A Brief Tour of Machine Learning

A Brief Tour of Machine Learning

The role of pandas in machine learning

Installation of scikit-learn

Introduction to machine learning

Application of machine learning – Kaggle Titanic competition

Data analysis and preprocessing using pandas

A naive approach to the Titanic problem

The scikit-learn ML/classifier interface

Supervised learning algorithms

Unsupervised learning algorithms

Other Books You May Enjoy

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

URL and S3

Sometimes, the data is directly available as a URL. In such cases, read_csv can be directly used to read from these URLs:

pd.read_csv('http://bit.ly/2cLzoxH').head()

Alternatively, to work with URLs in order to get data, we can use a couple of Python packages that we haven't used so far, such as .csv and .urllib. It would suffice to know that .csv provides a range of methods for handling .csv files and that urllib is used to navigate to and access information from the URL. Here is how we can do this:

import csv 
import urllib2 
 
url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data' 
response=urllib2.urlopen(url) 
cr=csv.reader(response) 
 
for rows in cr: 
   print rows

AWS S3 is a popular file-sharing and storage repository on the web. Many enterprises store their business operations data as files on S3, which needs...