Sign In Start Free Trial

Book Overview & Buying
Table Of Contents

Python Data Analysis - Third Edition

By : Avinash Navlani, Ivan Idris

4.5 (13)

Python Data Analysis

4.5 (13)

By: Avinash Navlani, Ivan Idris

Overview of this book

Data analysis enables you to generate value from small and big data by discovering new patterns and trends, and Python is one of the most popular tools for analyzing a wide variety of data. With this book, you’ll get up and running using Python for data analysis by exploring the different phases and methodologies used in data analysis and learning how to use modern libraries from the Python ecosystem to create efficient data pipelines. Starting with the essential statistical and data analysis fundamentals using Python, you’ll perform complex data analysis and modeling, data manipulation, data cleaning, and data visualization using easy-to-follow examples. You’ll then understand how to conduct time series analysis and signal processing using ARMA models. As you advance, you’ll get to grips with smart processing and data analytics using machine learning algorithms such as regression, classification, Principal Component Analysis (PCA), and clustering. In the concluding chapters, you’ll work on real-world examples to analyze textual and image data using natural language processing (NLP) and image analytics techniques, respectively. Finally, the book will demonstrate parallel computing using Dask. By the end of this data analysis book, you’ll be equipped with the skills you need to prepare data for analysis and create meaningful data visualizations for forecasting values from data.

Preface

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Section 1: Foundation for Data Analysis

Section 1: Foundation for Data Analysis

Free Chapter

Getting Started with Python Libraries

Getting Started with Python Libraries

Understanding data analysis

The standard process of data analysis

The KDD process

SEMMA

CRISP-DM

Comparing data analysis and data science

The skillsets of data analysts and data scientists

Installing Python 3

Software used in this book

Using IPython as a shell

Using JupyterLab

Using Jupyter Notebooks

Advanced features of Jupyter Notebooks

Summary

NumPy and pandas

NumPy and pandas

Technical requirements

Understanding NumPy arrays

NumPy array numerical data types

Manipulating array shapes

The stacking of NumPy arrays

Partitioning NumPy arrays

Changing the data type of NumPy arrays

Creating NumPy views and copies

Slicing NumPy arrays

Boolean and fancy indexing

Broadcasting arrays

Creating pandas DataFrames

Understanding pandas Series

Reading and querying the Quandl data

Describing pandas DataFrames

Grouping and joining pandas DataFrame

Working with missing values

Creating pivot tables

Dealing with dates

Summary

References

Statistics

Statistics

Technical requirements

Understanding attributes and their types

Measuring central tendency

Measuring dispersion

Skewness and kurtosis

Understanding relationships using covariance and correlation coefficients

Central limit theorem

Collecting samples

Performing parametric tests

Performing non-parametric tests

Summary

Linear Algebra

Linear Algebra

Technical requirements

Fitting to polynomials with NumPy

Determinant

Finding the rank of a matrix

Matrix inverse using NumPy

Solving linear equations using NumPy

Decomposing a matrix using SVD

Eigenvectors and Eigenvalues using NumPy

Generating random numbers

Binomial distribution

Normal distribution

Testing normality of data using SciPy

Creating a masked array using the numpy.ma subpackage

Summary

Section 2: Exploratory Data Analysis and Data Cleaning

Section 2: Exploratory Data Analysis and Data Cleaning

Data Visualization

Data Visualization

Technical requirements

Visualization using Matplotlib

Advanced visualization using the Seaborn package

Interactive visualization with Bokeh

Summary

Retrieving, Processing, and Storing Data

Retrieving, Processing, and Storing Data

Technical requirements

Reading and writing CSV files with NumPy

Reading and writing CSV files with pandas

Reading and writing data from Excel

Reading and writing data from JSON

Reading and writing data from HDF5

Reading and writing data from HTML tables

Reading and writing data from Parquet

Reading and writing data from a pickle pandas object

Lightweight access with sqllite3

Reading and writing data from MySQL

Reading and writing data from MongoDB

Reading and writing data from Cassandra

Reading and writing data from Redis

PonyORM

Summary

Cleaning Messy Data

Cleaning Messy Data

Technical requirements

Exploring data

Filtering data to weed out the noise

Handling missing values

Handling outliers

Feature encoding techniques

Feature scaling

Feature transformation

Feature splitting

Summary

Signal Processing and Time Series

Signal Processing and Time Series

Technical requirements

The statsmodels modules

Moving averages

Window functions

Defining cointegration

STL decomposition

Autocorrelation

Autoregressive models

ARMA models

Generating periodic signals

Fourier analysis

Spectral analysis filtering

Summary

Section 3: Deep Dive into Machine Learning

Section 3: Deep Dive into Machine Learning

Supervised Learning - Regression Analysis

Supervised Learning - Regression Analysis

Technical requirements

Linear regression

Understanding multicollinearity

Dummy variables

Developing a linear regression model

Evaluating regression model performance

Fitting polynomial regression

Regression models for classification

Logistic regression

Implementing logistic regression using scikit-learn

Summary

Supervised Learning - Classification Techniques

Supervised Learning - Classification Techniques

Technical requirements

Classification

Naive Bayes classification

Decision tree classification

KNN classification

SVM classification

Splitting training and testing sets

Evaluating the classification model performance

ROC curve and AUC

Summary

Unsupervised Learning - PCA and Clustering

Unsupervised Learning - PCA and Clustering

Technical requirements

Unsupervised learning

Reducing the dimensionality of data

Clustering

Partitioning data using k-means clustering

Hierarchical clustering

DBSCAN clustering

Spectral clustering

Evaluating clustering performance

Summary

Section 4: NLP, Image Analytics, and Parallel Computing

Section 4: NLP, Image Analytics, and Parallel Computing

Analyzing Textual Data

Analyzing Textual Data

Technical requirements

Installing NLTK and SpaCy

Text normalization

Tokenization

Removing stopwords

Stemming and lemmatization

POS tagging

Recognizing entities

Dependency parsing

Creating a word cloud

Bag of Words

TF-IDF

Sentiment analysis using text classification

Text similarity

Summary

Analyzing Image Data

Analyzing Image Data

Technical requirements

Installing OpenCV

Understanding image data

Color models

Drawing on images

Writing on images

Resizing images

Flipping images

Changing the brightness

Blurring an image

Face detection

Summary

Parallel Computing Using Dask

Parallel Computing Using Dask

Parallel computing using Dask

Dask data types

Dask Delayed

Preprocessing data at scale

Machine learning at scale

Summary

Other Books You May Enjoy

Other Books You May Enjoy

Leave a review - let other readers know what you think

Splitting training and testing sets

Data scientists need to assess the performance of a model, overcome overfitting, and tune the hyperparameters. All these tasks require some hidden data records that were not used in the model development phase. Before model development, the data needs to be divided into some parts, such as train, test, and validation sets. The training dataset is used to build the model. The test dataset is used to assess the performance of a model that was trained on the train set. The validation set is used to find the hyperparameters. Let's look at the following strategies for the train-test split in the upcoming subsections:

Holdout method
K-fold cross-validation
Bootstrap method

Holdout

In this method, the dataset is divided randomly into two parts: a training and testing set. Generally, this ratio is 2:1, which means 2/3 for training and 1/3 for testing. We can also split it into different ratios, such as 6:4, 7:3, and 8:2:

# partition data into training...

CONTINUE READING

83

Tech Concepts

36

Programming languages

73

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

Python Data Analysis

Search

Your notes and bookmarks