Essential PySpark for Scalable Data Analytics

By : Sreeram Nudurupati

Essential PySpark for Scalable Data Analytics

By: Sreeram Nudurupati

Overview of this book

Apache Spark is a unified data analytics engine designed to process huge volumes of data quickly and efficiently. PySpark is Apache Spark's Python language API, which offers Python developers an easy-to-use scalable data analytics framework. Essential PySpark for Scalable Data Analytics starts by exploring the distributed computing paradigm and provides a high-level overview of Apache Spark. You'll begin your analytics journey with the data engineering process, learning how to perform data ingestion, cleansing, and integration at scale. This book helps you build real-time analytics pipelines that help you gain insights faster. You'll then discover methods for building cloud-based data lakes, and explore Delta Lake, which brings reliability to data lakes. The book also covers Data Lakehouse, an emerging paradigm, which combines the structure and performance of a data warehouse with the scalability of cloud-based data lakes. Later, you'll perform scalable data science and machine learning tasks using PySpark, such as data preparation, feature engineering, and model training and productionization. Finally, you'll learn ways to scale out standard Python ML libraries along with a new pandas API on top of PySpark called Koalas. By the end of this PySpark book, you'll be able to harness the power of PySpark to solve business problems.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Share your thoughts

Section 1: Data Engineering

Free Chapter

Chapter 1: Distributed Computing Primer

Technical requirements

Distributed Computing

Distributed Computing with Apache Spark

Big data processing with Spark SQL and DataFrames

Summary

Chapter 2: Data Ingestion

Technical requirements

Introduction to Enterprise Decision Support Systems

Ingesting data from data sources

Ingesting data into data sinks

Using file formats for data storage in data lakes

Building data ingestion pipelines in batch and real time

Unifying batch and real time using Lambda Architecture

Summary

Chapter 3: Data Cleansing and Integration

Technical requirements

Transforming raw data into enriched meaningful data

Building analytical data stores using cloud data lakes

Consolidating data using data integration

Making raw data analytics-ready using data cleansing

Summary

Chapter 4: Real-Time Data Analytics

Technical requirements

Real-time analytics systems architecture

Stream processing engines

Real-time analytics industry use cases

Simplifying the Lambda Architecture using Delta Lake

Change Data Capture

Handling late-arriving data

Multi-hop pipelines

Summary

Section 2: Data Science

Chapter 5: Scalable Machine Learning with PySpark

Technical requirements

ML overview

Scaling out machine learning

Data wrangling with Apache Spark and MLlib

Summary

Chapter 6: Feature Engineering – Extraction, Transformation, and Selection

Technical requirements

The machine learning process

Feature extraction

Feature transformation

Feature selection

Feature store as a central feature repository

Delta Lake as an offline feature store

Summary

Chapter 7: Supervised Machine Learning

Technical requirements

Introduction to supervised machine learning

Regression

Classification

Tree ensembles

Real-world supervised learning applications

Summary

Chapter 8: Unsupervised Machine Learning

Technical requirements

Introduction to unsupervised machine learning

Clustering using machine learning

Building association rules using machine learning

Real-world applications of unsupervised learning

Summary

Chapter 9: Machine Learning Life Cycle Management

Technical requirements

Introduction to the ML life cycle

Tracking experiments with MLflow

Tracking model versions using MLflow Model Registry

Model serving and inferencing

Continuous delivery for ML

Summary

Chapter 10: Scaling Out Single-Node Machine Learning Using PySpark

Technical requirements

Scaling out EDA

Scaling out model inferencing

Model training using embarrassingly parallel computing

Upgrading pandas to PySpark using Koalas

Summary

Section 3: Data Analysis

Chapter 11: Data Visualization with PySpark

Technical requirements

Importance of data visualization

Techniques for visualizing data using PySpark

Considerations for PySpark to pandas conversion

Summary

Chapter 12: Spark SQL Primer

Technical requirements

Introduction to SQL

Introduction to Spark SQL

Spark SQL language reference

Optimizing Spark SQL performance

Summary

Chapter 13: Integrating External Tools with Spark SQL

Technical requirements

Apache Spark as a distributed SQL engine

Spark connectivity to SQL analysis tools

Spark connectivity to BI tools

Connecting Python applications to Spark SQL using Pyodbc

Summary

Chapter 14: The Data Lakehouse

Moving from BI to AI

The data lakehouse paradigm

Advantages of data lakehouses

Summary

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share your thoughts

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Scaling out machine learning

In the previous sections, we learned that ML is a set of algorithms that, instead of being explicitly programmed, automatically learn patterns hidden within data. Thus, an ML algorithm exposed to a larger dataset can potentially result in a better-performing model. However, traditional ML algorithms were designed to be trained on a limited data sample and on a single machine at a time. This means that the existing ML libraries are not inherently scalable. One solution to this problem is to down-sample a larger dataset to fit in the memory of a single machine, but this also potentially means that the resulting models aren't as accurate as they could be.

Also, typically, several ML models are built on the same dataset, simply varying the parameters supplied to the algorithm. Out of these several models, the best model is chosen for production purposes, using a technique called hyperparameter tuning. Building several models using a single machine,...

Essential PySpark for Scalable Data Analytics

By : Sreeram Nudurupati

Essential PySpark for Scalable Data Analytics

By: Sreeram Nudurupati

Overview of this book

Related Content you might be interested in

Current Title:

Essential PySpark for Scalable Data Analytics

Optimizing Databricks Workloads

Simplifying Data Engineering and Analytics with Delta

Practical Machine Learning on Databricks

Scaling out machine learning