Book Image

Hands-On Big Data Analytics with PySpark

By : Rudy Lai, Bartłomiej Potaczek
Book Image

Hands-On Big Data Analytics with PySpark

By: Rudy Lai, Bartłomiej Potaczek

Overview of this book

Apache Spark is an open source parallel-processing framework that has been around for quite some time now. One of the many uses of Apache Spark is for data analytics applications across clustered computers. In this book, you will not only learn how to use Spark and the Python API to create high-performance analytics with big data, but also discover techniques for testing, immunizing, and parallelizing Spark jobs. You will learn how to source data from all popular data hosting platforms, including HDFS, Hive, JSON, and S3, and deal with large datasets with PySpark to gain practical big data experience. This book will help you work on prototypes on local machines and subsequently go on to handle messy data in production and at scale. This book covers installing and setting up PySpark, RDD operations, big data cleaning and wrangling, and aggregating and summarizing data into useful reports. You will also learn how to implement some practical and proven techniques to improve certain aspects of programming and administration in Apache Spark. By the end of the book, you will be able to build big data analytical solutions using the various PySpark offerings and also optimize them effectively.
Table of Contents (15 chapters)

Detecting a shuffle in a process

In this section, we will learn how to detect a shuffle in a process.

In this section, we will cover the following topics:

  • Loading randomly partitioned data
  • Issuing repartition using a meaningful partition key
  • Understanding how shuffle occurs by explaining a query

We will load randomly partitioned data to see how and where the data is loaded. Next, we will issue a partition using a meaningful partition key. We will then repartition data to the proper executors using the deterministic and meaningful key. In the end, we will explain our queries by using the explain() method and understand the shuffle. Here, we have a very simple test.

We will create a DataFrame with some data. For example, we created an InputRecord with some random UID and user_1, and another input with random ID in user_1, and the last record for user_2. Let's imagine that...