Book Image

Hands-On Big Data Analytics with PySpark

By : Rudy Lai, Bartłomiej Potaczek
Book Image

Hands-On Big Data Analytics with PySpark

By: Rudy Lai, Bartłomiej Potaczek

Overview of this book

Apache Spark is an open source parallel-processing framework that has been around for quite some time now. One of the many uses of Apache Spark is for data analytics applications across clustered computers. In this book, you will not only learn how to use Spark and the Python API to create high-performance analytics with big data, but also discover techniques for testing, immunizing, and parallelizing Spark jobs. You will learn how to source data from all popular data hosting platforms, including HDFS, Hive, JSON, and S3, and deal with large datasets with PySpark to gain practical big data experience. This book will help you work on prototypes on local machines and subsequently go on to handle messy data in production and at scale. This book covers installing and setting up PySpark, RDD operations, big data cleaning and wrangling, and aggregating and summarizing data into useful reports. You will also learn how to implement some practical and proven techniques to improve certain aspects of programming and administration in Apache Spark. By the end of the book, you will be able to build big data analytical solutions using the various PySpark offerings and also optimize them effectively.
Table of Contents (15 chapters)

Integration testing using SparkSession

Let's now learn about integration testing using SparkSession.

In this section, we will cover the following topics:

  • Leveraging SparkSession for integration testing
  • Using a unit tested component

Here, we are creating the Spark engine. The following line is crucial for the integration test:

 val spark: SparkContext = SparkSession.builder().master("local[2]").getOrCreate().sparkContext

It is not a simple line just to create a lightweight object. SparkSession is a really heavy object and constructing it from scratch is an expensive operation from the perspective of resources and time. Tests such as creating SparkSession will take more time compared to the unit testing from the previous section.

For the same reason, we should use unit tests often to convert all edge cases and use integration testing only for the smaller part of...