Book Image

Essential PySpark for Scalable Data Analytics

By : Sreeram Nudurupati
Book Image

Essential PySpark for Scalable Data Analytics

By: Sreeram Nudurupati

Overview of this book

Apache Spark is a unified data analytics engine designed to process huge volumes of data quickly and efficiently. PySpark is Apache Spark's Python language API, which offers Python developers an easy-to-use scalable data analytics framework. Essential PySpark for Scalable Data Analytics starts by exploring the distributed computing paradigm and provides a high-level overview of Apache Spark. You'll begin your analytics journey with the data engineering process, learning how to perform data ingestion, cleansing, and integration at scale. This book helps you build real-time analytics pipelines that help you gain insights faster. You'll then discover methods for building cloud-based data lakes, and explore Delta Lake, which brings reliability to data lakes. The book also covers Data Lakehouse, an emerging paradigm, which combines the structure and performance of a data warehouse with the scalability of cloud-based data lakes. Later, you'll perform scalable data science and machine learning tasks using PySpark, such as data preparation, feature engineering, and model training and productionization. Finally, you'll learn ways to scale out standard Python ML libraries along with a new pandas API on top of PySpark called Koalas. By the end of this PySpark book, you'll be able to harness the power of PySpark to solve business problems.
Table of Contents (19 chapters)
1
Section 1: Data Engineering
6
Section 2: Data Science
13
Section 3: Data Analysis

Distributed Computing with Apache Spark

Over the last decade, Apache Spark has grown to be the de facto standard for big data processing. Indeed, it is an indispensable tool in the hands of anyone involved with data analytics.

Here, we will begin with the basics of Apache Spark, including its architecture and components. Then, we will get started with the PySpark programming API to actually implement the previously illustrated word count problem. Finally, we will take a look at what's new with the latest 3.0 release of Apache Spark.

Introduction to Apache Spark

Apache Spark is an in-memory, unified data analytics engine that is relatively fast compared to other distributed data processing frameworks.

It is a unified data analytics framework because it can process different types of big data workloads with a single engine. The different workloads include the following

  • Batch data processing
  • Real-time data processing
  • Machine learning and data science

Typically, data analytics involves all or a combination of the previously mentioned workloads to solve a single business problem. Before Apache Spark, there was no single framework that could accommodate all three workloads simultaneously. With Apache Spark, various teams involved in data analytics can all use a single framework to solve a single business problem, thus improving communication and collaboration among teams and drastically reducing their learning curve.

We will explore each of the preceding workloads, in depth, in Chapter 2, Data Ingestion through to Chapter 8, Unsupervised Machine Learning, of this book.

Further, Apache Spark is fast in two aspects:

  • It is fast in terms of data processing speed.
  • It is fast in terms of development speed.

Apache Spark has fast job/query execution speeds because it does all of its data processing in memory, and it has other optimizations techniques built-in such as Lazy Evaluation, Predicate Pushdown, and Partition Pruning to name a few. We will go over Spark's optimization techniques in the coming chapters.

Secondly, Apache Spark provides developers with very high-level APIs to perform basic data processing operations such as filtering, grouping, sorting, joining, and aggregating. By using these high-level programming constructs, developers can very easily express their data processing logic, making their development many times faster.

The core abstraction of Apache Spark, which makes it fast and very expressive for data analytics, is called an RDD. We will cover this in the next section.

Data Parallel Processing with RDDs

An RDD is the core abstraction of the Apache Spark framework. Think of an RDD as any kind of immutable data structure that is typically found in a programming language but one that resides in the memory of several machines instead of just one. An RDD consists of partitions, which are logical divisions of an RDD, with a few of them residing on each machine.

The following diagram helps explain the concepts of an RDD and its partitions:

Figure 1.2 – An RDD

Figure 1.2 – An RDD

In the previous diagram, we have a cluster of three machines or nodes. There are three RDDs on the cluster, and each RDD is divided into partitions. Each node of the cluster contains a few partitions of an individual RDD, and each RDD is distributed among several nodes of the cluster by means of partitions.

The RDD abstractions are accompanied by a set of high-level functions that can operate on the RRDs in order to manipulate the data stored within the partitions. These functions are called higher-order functions, and you will learn about them in the following section.

Higher-order functions

Higher-order functions manipulate RDDs and help us write business logic to transform data stored within the partitions. Higher-order functions accept other functions as parameters, and these inner functions help us define the actual business logic that transforms data and is applied to each partition of the RDD in parallel. These inner functions passed as parameters to the higher-order functions are called lambda functions or lambdas.

Apache Spark comes with several higher-order functions such as map, flatMap, reduce, fold, filter, reduceByKey, join, and union to name a few. These functions are high-level functions and help us express our data manipulation logic very easily.

For example, consider our previously illustrated word count example. Let's say you wanted to read a text file as an RDD and split each word based on a delimiter such as a whitespace. This is what code expressed in terms of an RDD and higher-order function would look like:

lines = sc.textFile("/databricks-datasets/README.md")
words = lines.flatMap(lambda s: s.split(" "))
word_tuples = words.map(lambda s: (s, 1))

In the previous code snippet, the following occurs:

  1. We are loading a text file using the built-in sc.textFile() method, which loads all text files at the specified location into the cluster memory, splits them into individual lines, and returns an RDD of lines or strings.
  2. We then apply the flatMap() higher-order function to the new RDD of lines and supply it with a function that instructs it to take each line and split it based on a white space. The lambda function that we pass to flatMap() is simply an anonymous function that takes one parameter, an individual line of StringType, and returns a list of words. Through the flatMap() and lambda() functions, we are able to transform an RDD of lines into an RDD of words.
  3. Finally, we use the map() function to assign a count of 1 to every individual word. This is pretty easy and definitely more intuitive compared to developing a MapReduce application using the Java programming language.

To summarize what you have learned, the primary construct of the Apache Spark framework is an RDD. An RDD consists of partitions distributed across individual nodes of a cluster. We use special functions called higher-order functions to operate on the RDDs and transform the RDDs according to our business logic. This business logic is passed along to the Worker Nodes via higher-order functions in the form of lambdas or anonymous functions.

Before we dig deeper into the inner workings of higher-order functions and lambda functions, we need to understand the architecture of the Apache Spark framework and the components of a typical Spark Cluster. We will do this in the following section.

Note

The Resilient part of an RDD comes from the fact that every RDD knows its lineage. At any given point in time, an RDD has information of all the individual operations performed on it, going back all the way to the data source itself. Thus, if any Executors are lost due to any failures and one or more of its partitions are lost, it can easily recreate those partitions from the source data making use of the lineage information, thus making it Resilient to failures.

Apache Spark cluster architecture

A typical Apache Spark cluster consists of three major components, namely, the Driver, a few Executors, and the Cluster Manager:

Figure 1.3 – Apache Spark Cluster components

Figure 1.3 – Apache Spark Cluster components

Let's examine each of these components a little closer.

Driver – the heart of a Spark application

The Spark Driver is a Java Virtual Machine process and is the core part of a Spark application. It is responsible for user application code declarations, along with the creation of RDDs, DataFrames, and datasets. It is also responsible for coordinating with and running code on the Executors and creating and scheduling tasks on the Executors. It is even responsible for relaunching Executors after a failure and finally returning any data requested back to the client or the user. Think of a Spark Driver as the main() program of any Spark application.

Important note

The Driver is the single point of failure for a Spark cluster, and the entire Spark application fails if the driver fails; therefore, different Cluster Managers implement different strategies to make the Driver highly available.

Executors – the actual workers

Spark Executors are also Java Virtual Machine processes, and they are responsible for running operations on RDDs that actually transform data. They can cache data partitions locally and return the processed data back to the Driver or write to persistent storage. Each Executor runs operations on a set of partitions of an RDD in parallel.

Cluster Manager – coordinates and manages cluster resources

The Cluster Manager is a process that runs centrally on the cluster and is responsible for providing resources requested by the Driver. It also monitors the Executors regarding task progress and their status. Apache Spark comes with its own Cluster Manager, which is referred to as the Standalone Cluster Manager, but it also supports other popular Cluster Managers such as YARN or Mesos. Throughout this book, we will be using Spark's built-in Standalone Cluster Manager.

Getting started with Spark

So far, we have learnt about Apache Spark's core data structure, called RDD, the functions used to manipulate RDDs, called higher-order functions, and the components of an Apache Spark cluster. You have also seen a few code snippets on how to use higher-order functions.

In this section, you will put your knowledge to practical use and write your very first Apache Spark program, where you will use Spark's Python API called PySpark to create a word count application. However, first, we need a few things to get started:

  • An Apache Spark cluster
  • Datasets
  • Actual code for the word count application

We will use the free Community Edition of Databricks to create our Spark cluster. The code used can be found via the GitHub link that was mentioned at the beginning of this chapter. The links for the required resources can be found in the Technical requirements section toward the beginning of the chapter.

Note

Although we are using Databricks Spark Clusters in this book, the provided code can be executed on any Spark cluster running Spark 3.0, or higher, as long as data is provided at a location accessible by your Spark cluster.

Now that you have gained an understanding of Spark's core concepts such as RDDs, higher-order functions, lambdas, and Spark's architecture, let's implement your very first Spark application using the following code:

lines = sc.textFile("/databricks-datasets/README.md")
words = lines.flatMap(lambda s: s.split(" "))
word_tuples = words.map(lambda s: (s, 1))
word_count = word_tuples.reduceByKey(lambda x, y:  x + y)
word_count.take(10)
word_count.saveAsTextFile("/tmp/wordcount.txt")

In the previous code snippet, the following takes place:

  1. We load a text file using the built-in sc.textFile() method, which reads all of the text files at the specified location, splits them into individual lines, and returns an RDD of lines or strings.
  2. Then, we apply the flatMap() higher-order function to the RDD of lines and supply it with a function that instructs it to take each line and split it based on a white space. The lambda function that we pass to flatMap() is simply an anonymous function that takes one parameter, a line, and returns individual words as a list. By means of the flatMap() and lambda() functions, we are able to transform an RDD of lines into an RDD of words.
  3. Then, we use the map() function to assign a count of 1 to every individual word.
  4. Finally, we use the reduceByKey() higher-order function to sum up the count of similar words occurring multiple times.
  5. Once the counts have been calculated, we make use of the take() function to display a sample of the final word counts.
  6. Although displaying a sample result set is usually helpful in determining the correctness of our code, in a big data setting, it is not practical to display all the results on to the console. So, we make use of the saveAsTextFile() function to persist our finals results in persistent storage.

    Important note

    It is not recommended that you display the entire result set onto the console using commands such as take() or collect(). It could even be outright dangerous to try and display all the data in a big data setting, as it could try to bring way too much data back to the driver and cause the driver to fail with an OutOfMemoryError, which, in turn, causes the entire application to fail.

    Therefore, it is recommended that you use take() with a very small result set, and use collect() only when you are confident that the amount of data returned is, indeed, very small.

Let's dive deeper into the following line of code in order to understand the inner workings of lambdas and how they implement Data Parallel Processing along with higher-order functions:

words = lines.flatMap(lambda s: s.split(" "))

In the previous code snippet, the flatMmap() higher-order function bundles the code present in the lambda and sends it over a network to the Worker Nodes, using a process called serialization. This serialized lambda is then sent out to every executor, and each executor, in turn, applies this lambda to individual RDD partitions in parallel.

Important note

Since higher-order functions need to be able to serialize the lambdas in order to send your code to the Executors. The lambda functions need to be serializable, and failing this, you might encounter a Task not serializable error.

In summary, higher-order functions are, essentially, transferring your data transformation code in the form of serialized lambdas to your data in RDD partitions. Therefore, instead of moving data to where the code is, we are actually moving our code to where data is situated, which is the exact definition of Data Parallel Processing, as we learned earlier in this chapter.

Thus, Apache Spark along with its RDDs and higher-order functions implements an in-memory version of the Data Parallel Processing paradigm. This makes Apache Spark fast and efficient at big data processing in a Distributed Computing setting.

The RDD abstraction of Apache Spark definitely offers a higher level of programming API compared to MapReduce, but it still requires some level of comprehension of the functional programming style to be able to express even the most common types of data transformations. To overcome this challenge, Spark's already existing SQL engine was expanded, and another abstraction, called the DataFrame, was added on top of RDDs. This makes data processing much easier and more familiar for data scientists and data analysts. The following section will explore the DataFrame and SQL API of the Spark SQL engine.