Book Image

Mastering pandas - Second Edition

By : Ashish Kumar

Book Image

Mastering pandas - Second Edition

By: Ashish Kumar

Overview of this book

pandas is a popular Python library used by data scientists and analysts worldwide to manipulate and analyze their data. This book presents useful data manipulation techniques in pandas to perform complex data analysis in various domains. An update to our highly successful previous edition with new features, examples, updated code, and more, this book is an in-depth guide to get the most out of pandas for data analysis. Designed for both intermediate users as well as seasoned practitioners, you will learn advanced data manipulation techniques, such as multi-indexing, modifying data structures, and sampling your data, which allow for powerful analysis and help you gain accurate insights from it. With the help of this book, you will apply pandas to different domains, such as Bayesian statistics, predictive analytics, and time series analysis using an example-based approach. And not just that; you will also learn how to prepare powerful, interactive business reports in pandas using the Jupyter notebook. By the end of this book, you will learn how to perform efficient data analysis using pandas on complex data, and become an expert data analyst or data scientist in the process.

Preface

Who this book is for

What this book covers

To get the most out of this book

Free Chapter

Section 1: Overview of Data Analysis and pandas

Section 1: Overview of Data Analysis and pandas

Introduction to pandas and Data Analysis

Introduction to pandas and Data Analysis

Motivation for data analysis

Data analytics pipeline

How Python and pandas fit into the data analytics pipeline

What is pandas?

Where does pandas fit in the pipeline?

Benefits of using pandas

History of pandas

Usage pattern and adoption of pandas

pandas on the technology adoption curve

Popular applications of pandas

Installation of pandas and Supporting Software

Installation of pandas and Supporting Software

Selecting a version of Python to use

Standalone Python installation

Installation of Python and pandas using Anaconda

Dependency packages for pandas

Review of items installed with Anaconda

Cross tooling – combining pandas awesomeness with R, Julia, H20.ai, and Azure ML Studio

Command line tricks for pandas

Options and settings for pandas

Further reading

Section 2: Data Structures and I/O in pandas

Section 2: Data Structures and I/O in pandas

Using NumPy and Data Structures with pandas

Using NumPy and Data Structures with pandas

Implementing neural networks with NumPy

Practical applications of multidimensional arrays

Data structures in pandas

I/Os of Different Data Formats with pandas

I/Os of Different Data Formats with pandas

Data sources and pandas methods

Reading HDF formats

Reading feather files

Reading parquet files

Reading a SQL file

Reading a SAS/Stata file

Reading from Google BigQuery

Reading from a clipboard

Managing sparse data

Writing JSON objects to a file

Serialization/deserialization

Writing to exotic file types

Open source APIs – Quandl

Pandas plotting

pandas-datareader

Section 3: Mastering Different Data Operations in pandas

Section 3: Mastering Different Data Operations in pandas

Indexing and Selecting in pandas

Indexing and Selecting in pandas

Labels, integer, and mixed indexing

Boolean indexing

Operations on indexes

Grouping, Merging, and Reshaping Data in pandas

Grouping, Merging, and Reshaping Data in pandas

Merging and joining

Pivots and reshaping data

Other methods for reshaping DataFrames

Special Data Operations in pandas

Special Data Operations in pandas

Writing and applying one-liner custom functions

Handling missing values

A survey of methods on series

pandas string methods

Binary operations on DataFrames and series

Using mathematical methods on DataFrames

Time Series and Plotting Using Matplotlib

Time Series and Plotting Using Matplotlib

Handling time series data

A summary of time series-related objects

Plotting using matplotlib

Section 4: Going a Step Beyond with pandas

Section 4: Going a Step Beyond with pandas

Making Powerful Reports In Jupyter Using pandas

Making Powerful Reports In Jupyter Using pandas

Navigating Jupyter Notebook

A Tour of Statistics with pandas and NumPy

A Tour of Statistics with pandas and NumPy

Descriptive statistics versus inferential statistics

Measures of central tendency and variability

Hypothesis testing – the null and alternative hypotheses

A Brief Tour of Bayesian Statistics and Maximum Likelihood Estimates

A Brief Tour of Bayesian Statistics and Maximum Likelihood Estimates

Introduction to Bayesian statistics

The mathematical framework for Bayesian statistics

Probability distributions

Bayesian statistics versus frequentist statistics

Conducting Bayesian statistical analysis

Monte Carlo estimation of the likelihood function and PyMC

Data Case Studies Using pandas

Data Case Studies Using pandas

End-to-end exploratory data analysis

Web scraping with Python

Data validation

The pandas Library Architecture

The pandas Library Architecture

Understanding the pandas file hierarchy

Improving performance using Python extensions

pandas Compared with Other Tools

pandas Compared with Other Tools

Comparison with R

Slicing and selection

Comparison with SQL

Comparison with SAS

A Brief Tour of Machine Learning

A Brief Tour of Machine Learning

The role of pandas in machine learning

Installation of scikit-learn

Introduction to machine learning

Application of machine learning – Kaggle Titanic competition

Data analysis and preprocessing using pandas

A naive approach to the Titanic problem

The scikit-learn ML/classifier interface

Supervised learning algorithms

Unsupervised learning algorithms

Other Books You May Enjoy

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

A naive approach to the Titanic problem

Our first attempt at classifying the Titanic data is to use a naive, yet very intuitive, approach. This approach involves the following steps:

Select a set of features, S, that influence whether a person survived or not.
For each possible combination of features, use the training data to indicate whether the majority of cases survived or not. This can be evaluated in what is known as a survival matrix.
For each test example that we wish to predict survival, look up the combination of features that corresponds to the values of its features and assign its predicted value to the survival value in the survival table. This approach is a naive K-nearest neighbor approach.

Based on what we have seen earlier in our analysis, three features seem to have the most influence on the survival rate:

Passenger class
Gender
Passenger fare (bucketed)

...