Book Image

Essential PySpark for Scalable Data Analytics

By : Sreeram Nudurupati
Book Image

Essential PySpark for Scalable Data Analytics

By: Sreeram Nudurupati

Overview of this book

Apache Spark is a unified data analytics engine designed to process huge volumes of data quickly and efficiently. PySpark is Apache Spark's Python language API, which offers Python developers an easy-to-use scalable data analytics framework. Essential PySpark for Scalable Data Analytics starts by exploring the distributed computing paradigm and provides a high-level overview of Apache Spark. You'll begin your analytics journey with the data engineering process, learning how to perform data ingestion, cleansing, and integration at scale. This book helps you build real-time analytics pipelines that help you gain insights faster. You'll then discover methods for building cloud-based data lakes, and explore Delta Lake, which brings reliability to data lakes. The book also covers Data Lakehouse, an emerging paradigm, which combines the structure and performance of a data warehouse with the scalability of cloud-based data lakes. Later, you'll perform scalable data science and machine learning tasks using PySpark, such as data preparation, feature engineering, and model training and productionization. Finally, you'll learn ways to scale out standard Python ML libraries along with a new pandas API on top of PySpark called Koalas. By the end of this PySpark book, you'll be able to harness the power of PySpark to solve business problems.
Table of Contents (19 chapters)
1
Section 1: Data Engineering
6
Section 2: Data Science
13
Section 3: Data Analysis

What this book covers

Chapter 1, Distributed Computing Primer, introduces the distributed computing paradigm. It also talks about how distributed computing became a necessity with the ever-increasing data sizes over the last decade and ends with the in-memory data-parallel processing concept with the Map Reduce paradigm, and finally, contains introduction to the latest features in Apache Spark 3.0 engine.

Chapter 2, Data Ingestion, covers various data sources, such as databases, data lakes, message queues, and how to ingest data from these data sources. You will also learn about the uses, differences, and efficiency of various data storage formats at storing and processing data.

Chapter 3, Data Cleansing and Integration, discusses various data cleansing techniques, how to handle bad incoming data, data reliability challenges and how to cope with them, and data integration techniques to build a single integrated view of the data.

Chapter 4, Real-time Data Analytics, explains how to perform real-time data ingestion and processing, discusses the unique challenges that real-time data integration presents and how to overcome, and also the benefits it provides.

Chapter 5, Scalable Machine Learning with PySpark, briefly talks about the need to scale out machine learning and discusses various techniques available to achieve this from using natively distributed machine learning algorithms to embarrassingly parallel processing to distributed hyperparameter search. It also provides an introduction to PySpark MLlib library and an overview of its various distributed machine learning algorithms.

Chapter 6, Feature Engineering – Extraction, Transformation, and Selection, explores various techniques for converting raw data into features that are suitable to be consumed by machine learning models, including techniques for scaling, transforming features.

Chapter 7, Supervised Machine Learning, explores supervised learning techniques for machine learning classification and regression problems including linear regression, logistic regression, and gradient boosted trees.

Chapter 8, Unsupervised Machine Learning, covers unsupervised learning techniques such as clustering, collaborative filtering, and dimensionality reduction to reduce the number of features prior to applying supervised learning.

Chapter 9, Machine Learning Life Cycle Management, explains that it is not just sufficient to just build and train models, but in the real world, multiple versions of the same model are built and different versions are suitable for different applications. Thus, it is necessary to track various experiments, their hyperparameters, metrics, and also the version of the data they were trained on. It is also necessary to track and store the various models in a centrally accessible repository so models can be easily productionized and shared; and finally, mechanisms are needed to automate this repeatedly occurring process. This chapter introduces these techniques using an end-to-end open source machine learning life cycle management library called MLflow.

Chapter 10, Scaling Out Single-Node Machine Learning Using PySpark, explains that in Chapter 5, Scalable Machine Learning with PySpark, you learned how to use the power of Apache Spark's distributed computing framework to train and score machine learning models at scale. Spark's native machine learning library provides good coverage of standard tasks that data scientists typically perform; however, there is a wide variety of functionality provided by standard single-node Python libraries that were not designed to work in a distributed manner. This chapter deals with techniques for horizontally scaling out standard Python data processing and machine learning libraries such as pandas, scikit-learn, and XGBoost. This chapter covers scaling out typical data science tasks such as exploratory data analysis, model training, model inference, and finally also covers a scalable Python library named Koalas that lets you effortlessly write PySpark code using very familiar and easy-to-use pandas-like syntax.

Chapter 11, Data Visualization with PySpark, covers data visualizations, which are an important aspect of conveying meaning from data and gleaning insights into it. This chapter covers how the most popular Python visualization libraries can be used along with PySpark.

Chapter 12, Spark SQL Primer, covers SQL, which is an expressive language for ad hoc querying and data analysis. This chapter will introduce Spark SQL for data analysis and also show how to interchangeably use PySpark with data analysis.

Chapter 13, Integrating External Tools with Spark SQL, explains that once we have clean, curated, and reliable data in our performant data lake, it would be a missed opportunity to not democratize this data across the organization to citizen analysts. The most popular way of doing this is via various existing Business Intelligence (BI) tools. This chapter deals with requirements for BI tool integration.

Chapter 14, The Data Lakehouse, explains that traditional descriptive analytics tools such as BI tools are designed around data warehouses and expect data to be presented in a certain way and modern advanced analytics and data science tools are geared toward working with large amounts of data that's easily accessible in data lakes. It is also not practical or cost-effective to store redundant data in separate storage locations to be able to cater to these individual use cases. This chapter will present a new paradigm called Data Lakehouse that tries to overcome the limitations of data warehouses and data lakes and bridge the gap by combining the best elements of both.