Essential PySpark for Scalable Data Analytics

By : Sreeram Nudurupati

Essential PySpark for Scalable Data Analytics

By: Sreeram Nudurupati

Overview of this book

Apache Spark is a unified data analytics engine designed to process huge volumes of data quickly and efficiently. PySpark is Apache Spark's Python language API, which offers Python developers an easy-to-use scalable data analytics framework. Essential PySpark for Scalable Data Analytics starts by exploring the distributed computing paradigm and provides a high-level overview of Apache Spark. You'll begin your analytics journey with the data engineering process, learning how to perform data ingestion, cleansing, and integration at scale. This book helps you build real-time analytics pipelines that help you gain insights faster. You'll then discover methods for building cloud-based data lakes, and explore Delta Lake, which brings reliability to data lakes. The book also covers Data Lakehouse, an emerging paradigm, which combines the structure and performance of a data warehouse with the scalability of cloud-based data lakes. Later, you'll perform scalable data science and machine learning tasks using PySpark, such as data preparation, feature engineering, and model training and productionization. Finally, you'll learn ways to scale out standard Python ML libraries along with a new pandas API on top of PySpark called Koalas. By the end of this PySpark book, you'll be able to harness the power of PySpark to solve business problems.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Share your thoughts

Section 1: Data Engineering

Free Chapter

Chapter 1: Distributed Computing Primer

Technical requirements

Distributed Computing

Distributed Computing with Apache Spark

Big data processing with Spark SQL and DataFrames

Summary

Chapter 2: Data Ingestion

Technical requirements

Introduction to Enterprise Decision Support Systems

Ingesting data from data sources

Ingesting data into data sinks

Using file formats for data storage in data lakes

Building data ingestion pipelines in batch and real time

Unifying batch and real time using Lambda Architecture

Summary

Chapter 3: Data Cleansing and Integration

Technical requirements

Transforming raw data into enriched meaningful data

Building analytical data stores using cloud data lakes

Consolidating data using data integration

Making raw data analytics-ready using data cleansing

Summary

Chapter 4: Real-Time Data Analytics

Technical requirements

Real-time analytics systems architecture

Stream processing engines

Real-time analytics industry use cases

Simplifying the Lambda Architecture using Delta Lake

Change Data Capture

Handling late-arriving data

Multi-hop pipelines

Summary

Section 2: Data Science

Chapter 5: Scalable Machine Learning with PySpark

Technical requirements

ML overview

Scaling out machine learning

Data wrangling with Apache Spark and MLlib

Summary

Chapter 6: Feature Engineering – Extraction, Transformation, and Selection

Technical requirements

The machine learning process

Feature extraction

Feature transformation

Feature selection

Feature store as a central feature repository

Delta Lake as an offline feature store

Summary

Chapter 7: Supervised Machine Learning

Technical requirements

Introduction to supervised machine learning

Regression

Classification

Tree ensembles

Real-world supervised learning applications

Summary

Chapter 8: Unsupervised Machine Learning

Technical requirements

Introduction to unsupervised machine learning

Clustering using machine learning

Building association rules using machine learning

Real-world applications of unsupervised learning

Summary

Chapter 9: Machine Learning Life Cycle Management

Technical requirements

Introduction to the ML life cycle

Tracking experiments with MLflow

Tracking model versions using MLflow Model Registry

Model serving and inferencing

Continuous delivery for ML

Summary

Chapter 10: Scaling Out Single-Node Machine Learning Using PySpark

Technical requirements

Scaling out EDA

Scaling out model inferencing

Model training using embarrassingly parallel computing

Upgrading pandas to PySpark using Koalas

Summary

Section 3: Data Analysis

Chapter 11: Data Visualization with PySpark

Technical requirements

Importance of data visualization

Techniques for visualizing data using PySpark

Considerations for PySpark to pandas conversion

Summary

Chapter 12: Spark SQL Primer

Technical requirements

Introduction to SQL

Introduction to Spark SQL

Spark SQL language reference

Optimizing Spark SQL performance

Summary

Chapter 13: Integrating External Tools with Spark SQL

Technical requirements

Apache Spark as a distributed SQL engine

Spark connectivity to SQL analysis tools

Spark connectivity to BI tools

Connecting Python applications to Spark SQL using Pyodbc

Summary

Chapter 14: The Data Lakehouse

Moving from BI to AI

The data lakehouse paradigm

Advantages of data lakehouses

Summary

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share your thoughts

Customer Reviews

5 star

4 star

3 star

2 star

1 star

What this book covers

Chapter 1, Distributed Computing Primer, introduces the distributed computing paradigm. It also talks about how distributed computing became a necessity with the ever-increasing data sizes over the last decade and ends with the in-memory data-parallel processing concept with the Map Reduce paradigm, and finally, contains introduction to the latest features in Apache Spark 3.0 engine.

Chapter 2, Data Ingestion, covers various data sources, such as databases, data lakes, message queues, and how to ingest data from these data sources. You will also learn about the uses, differences, and efficiency of various data storage formats at storing and processing data.

Chapter 3, Data Cleansing and Integration, discusses various data cleansing techniques, how to handle bad incoming data, data reliability challenges and how to cope with them, and data integration techniques to build a single integrated view of the data.

Chapter 4, Real-time Data Analytics, explains how to perform real-time data ingestion and processing, discusses the unique challenges that real-time data integration presents and how to overcome, and also the benefits it provides.

Chapter 5, Scalable Machine Learning with PySpark, briefly talks about the need to scale out machine learning and discusses various techniques available to achieve this from using natively distributed machine learning algorithms to embarrassingly parallel processing to distributed hyperparameter search. It also provides an introduction to PySpark MLlib library and an overview of its various distributed machine learning algorithms.

Chapter 6, Feature Engineering – Extraction, Transformation, and Selection, explores various techniques for converting raw data into features that are suitable to be consumed by machine learning models, including techniques for scaling, transforming features.

Chapter 7, Supervised Machine Learning, explores supervised learning techniques for machine learning classification and regression problems including linear regression, logistic regression, and gradient boosted trees.

Chapter 8, Unsupervised Machine Learning, covers unsupervised learning techniques such as clustering, collaborative filtering, and dimensionality reduction to reduce the number of features prior to applying supervised learning.

Chapter 9, Machine Learning Life Cycle Management, explains that it is not just sufficient to just build and train models, but in the real world, multiple versions of the same model are built and different versions are suitable for different applications. Thus, it is necessary to track various experiments, their hyperparameters, metrics, and also the version of the data they were trained on. It is also necessary to track and store the various models in a centrally accessible repository so models can be easily productionized and shared; and finally, mechanisms are needed to automate this repeatedly occurring process. This chapter introduces these techniques using an end-to-end open source machine learning life cycle management library called MLflow.

Chapter 10, Scaling Out Single-Node Machine Learning Using PySpark, explains that in Chapter 5, Scalable Machine Learning with PySpark, you learned how to use the power of Apache Spark's distributed computing framework to train and score machine learning models at scale. Spark's native machine learning library provides good coverage of standard tasks that data scientists typically perform; however, there is a wide variety of functionality provided by standard single-node Python libraries that were not designed to work in a distributed manner. This chapter deals with techniques for horizontally scaling out standard Python data processing and machine learning libraries such as pandas, scikit-learn, and XGBoost. This chapter covers scaling out typical data science tasks such as exploratory data analysis, model training, model inference, and finally also covers a scalable Python library named Koalas that lets you effortlessly write PySpark code using very familiar and easy-to-use pandas-like syntax.

Chapter 11, Data Visualization with PySpark, covers data visualizations, which are an important aspect of conveying meaning from data and gleaning insights into it. This chapter covers how the most popular Python visualization libraries can be used along with PySpark.

Chapter 12, Spark SQL Primer, covers SQL, which is an expressive language for ad hoc querying and data analysis. This chapter will introduce Spark SQL for data analysis and also show how to interchangeably use PySpark with data analysis.

Chapter 13, Integrating External Tools with Spark SQL, explains that once we have clean, curated, and reliable data in our performant data lake, it would be a missed opportunity to not democratize this data across the organization to citizen analysts. The most popular way of doing this is via various existing Business Intelligence (BI) tools. This chapter deals with requirements for BI tool integration.

Chapter 14, The Data Lakehouse, explains that traditional descriptive analytics tools such as BI tools are designed around data warehouses and expect data to be presented in a certain way and modern advanced analytics and data science tools are geared toward working with large amounts of data that's easily accessible in data lakes. It is also not practical or cost-effective to store redundant data in separate storage locations to be able to cater to these individual use cases. This chapter will present a new paradigm called Data Lakehouse that tries to overcome the limitations of data warehouses and data lakes and bridge the gap by combining the best elements of both.

Essential PySpark for Scalable Data Analytics

By : Sreeram Nudurupati

Essential PySpark for Scalable Data Analytics

By: Sreeram Nudurupati

Overview of this book

Related Content you might be interested in

Current Title:

Essential PySpark for Scalable Data Analytics

Optimizing Databricks Workloads

Simplifying Data Engineering and Analytics with Delta

Practical Machine Learning on Databricks

What this book covers