Book Image

Essential PySpark for Scalable Data Analytics

By : Sreeram Nudurupati
Book Image

Essential PySpark for Scalable Data Analytics

By: Sreeram Nudurupati

Overview of this book

Apache Spark is a unified data analytics engine designed to process huge volumes of data quickly and efficiently. PySpark is Apache Spark's Python language API, which offers Python developers an easy-to-use scalable data analytics framework. Essential PySpark for Scalable Data Analytics starts by exploring the distributed computing paradigm and provides a high-level overview of Apache Spark. You'll begin your analytics journey with the data engineering process, learning how to perform data ingestion, cleansing, and integration at scale. This book helps you build real-time analytics pipelines that help you gain insights faster. You'll then discover methods for building cloud-based data lakes, and explore Delta Lake, which brings reliability to data lakes. The book also covers Data Lakehouse, an emerging paradigm, which combines the structure and performance of a data warehouse with the scalability of cloud-based data lakes. Later, you'll perform scalable data science and machine learning tasks using PySpark, such as data preparation, feature engineering, and model training and productionization. Finally, you'll learn ways to scale out standard Python ML libraries along with a new pandas API on top of PySpark called Koalas. By the end of this PySpark book, you'll be able to harness the power of PySpark to solve business problems.
Table of Contents (19 chapters)
1
Section 1: Data Engineering
6
Section 2: Data Science
13
Section 3: Data Analysis

Distributed Computing

In this section, you will learn about Distributed Computing, the need for it, and how you can use it to process very large amounts of data in a quick and efficient manner.

Introduction to Distributed Computing

Distributed Computing is a class of computing techniques where we use a group of computers as a single unit to solve a computational problem instead of just using a single machine.

In data analytics, when the amount of data becomes too large to fit in a single machine, we can either split the data into smaller chunks and process it on a single machine iteratively, or we can process the chunks of data on several machines in parallel. While the former gets the job done, it might take longer to process the entire dataset iteratively; the latter technique gets the job completed in a shorter period of time by using multiple machines at once.

There are different kinds of Distributed Computing techniques; however, for data analytics, one popular technique is Data Parallel Processing.

Data Parallel Processing

Data Parallel Processing involves two main parts:

  • The actual data that needs to be processed
  • The piece of code or business logic that needs to be applied to the data in order to process it

We can process large amounts of data by splitting it into smaller chunks and processing them in parallel on several machines. This can be done in two ways:

  • First, bring the data to the machine where our code is running.
  • Second, take our code to where our data is actually stored.

One drawback of the first technique is that as our data sizes become larger, the amount of time it takes to move data also increases proportionally. Therefore, we end up spending more time moving data from one system to another and, in turn, negating any efficiency gained by our parallel processing system. We also find ourselves creating multiple copies of data during data replication.

The second technique is far more efficient because instead of moving large amounts of data, we can easily move a few lines of code to where our data actually resides. This technique of moving code to where the data resides is referred to as Data Parallel Processing. This Data Parallel Processing technique is very fast and efficient, as we save the amount of time that was needed earlier to move and copy data across different systems. One such Data Parallel Processing technique is called the MapReduce paradigm.

Data Parallel Processing using the MapReduce paradigm

The MapReduce paradigm breaks down a Data Parallel Processing problem into three main stages:

  • The Map stage
  • The Shuffle stage
  • The Reduce stage

The Map stage takes the input dataset, splits it into (key, value) pairs, applies some processing on the pairs, and transforms them into another set of (key, value) pairs.

The Shuffle stage takes the (key, value) pairs from the Map stage and shuffles/sorts them so that pairs with the same key end up together.

The Reduce stage takes the resultant (key, value) pairs from the Shuffle stage and reduces or aggregates the pairs to produce the final result.

There can be multiple Map stages followed by multiple Reduce stages. However, a Reduce stage only starts after all of the Map stages have been completed.

Let's take a look at an example where we want to calculate the counts of all the different words in a text document and apply the MapReduce paradigm to it.

The following diagram shows how the MapReduce paradigm works in general:

Figure 1.1 – Calculating the word count using MapReduce

Figure 1.1 – Calculating the word count using MapReduce

The previous example works in the following manner:

  1. In Figure 1.1, we have a cluster of three nodes, labeled M1, M2, and M3. Each machine includes a few text files containing several sentences in plain text. Here, our goal is to use MapReduce to count all of the words in the text files.
  2. We load all the text documents onto the cluster; each machine loads the documents that are local to it.
  3. The Map Stage splits the text files into individual lines and further splits each line into individual words. Then, it assigns each word a count of 1 to create a (word, count) pair.
  4. The Shuffle Stage takes the (word, count) pairs from the Map stage and shuffles/sorts them so that word pairs with the same keyword end up together.
  5. The Reduce Stage groups all keywords together and sums their counts to produce the final count of each individual word.

The MapReduce paradigm was popularized by the Hadoop framework and was pretty popular for processing big data workloads. However, the MapReduce paradigm offers a very low-level API for transforming data and requires users to have proficient knowledge of programming languages such as Java. Expressing a data analytics problem using Map and Reduce is not very intuitive or flexible.

MapReduce was designed to run on commodity hardware, and since commodity hardware was prone to failures, resiliency to hardware failures was a necessity. MapReduce achieves resiliency to hardware failures by saving the results of every stage to disk. This round-trip to disk after every stage makes MapReduce relatively slow at processing data because of the slow I/O performance of physical disks in general. To overcome this limitation, the next generation of the MapReduce paradigm was created, which made use of much faster system memory, as opposed to disks, to process data and offered much more flexible APIs to express data transformations. This new framework is called Apache Spark, and you will learn about it in the next section and throughout the remainder of this book.

Important note

In Distributed Computing, you will often encounter the term cluster. A cluster is a group of computers all working together as a single unit to solve a computing problem. The primary machine of a cluster is, typically, termed the Master Node, which takes care of the orchestration and management of the cluster, and secondary machines that actually carry out task execution are called Worker Nodes. A cluster is a key component of any Distributed Computing system, and you will encounter these terms throughout this book.