Hands-On Big Data Analytics with PySpark

By : Rudy Lai, Bartłomiej Potaczek

Hands-On Big Data Analytics with PySpark

By: Rudy Lai, Bartłomiej Potaczek

Overview of this book

Apache Spark is an open source parallel-processing framework that has been around for quite some time now. One of the many uses of Apache Spark is for data analytics applications across clustered computers. In this book, you will not only learn how to use Spark and the Python API to create high-performance analytics with big data, but also discover techniques for testing, immunizing, and parallelizing Spark jobs. You will learn how to source data from all popular data hosting platforms, including HDFS, Hive, JSON, and S3, and deal with large datasets with PySpark to gain practical big data experience. This book will help you work on prototypes on local machines and subsequently go on to handle messy data in production and at scale. This book covers installing and setting up PySpark, RDD operations, big data cleaning and wrangling, and aggregating and summarizing data into useful reports. You will also learn how to implement some practical and proven techniques to improve certain aspects of programming and administration in Apache Spark. By the end of the book, you will be able to build big data analytical solutions using the various PySpark offerings and also optimize them effectively.

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Free Chapter

Installing Pyspark and Setting up Your Development Environment

An overview of PySpark

Setting up Spark on Windows and PySpark

Core concepts in Spark and PySpark

Summary

Getting Your Big Data into the Spark Environment Using RDDs

Loading data on to Spark RDDs

Parallelization with Spark RDDs

Basics of RDD operation

Summary

Big Data Cleaning and Wrangling with Spark Notebooks

Using Spark Notebooks for quick iteration of ideas

Sampling/filtering RDDs to pick out relevant data points

Splitting datasets and creating some new combinations

Summary

Aggregating and Summarizing Data into Useful Reports

Calculating averages with map and reduce

Faster average computations with aggregate

Pivot tabling with key-value paired data points

Summary

Powerful Exploratory Data Analysis with MLlib

Computing summary statistics with MLlib

Using Pearson and Spearman correlations to discover correlations

Testing our hypotheses on large datasets

Summary

Putting Structure on Your Big Data with SparkSQL

Manipulating DataFrames with Spark SQL schemas

Using Spark DSL to build queries

Summary

Transformations and Actions

Using Spark transformations to defer computations to a later time

Avoiding transformations

Using the reduce and reduceByKey methods to calculate the results

Performing actions that trigger computations

Reusing the same rdd for different actions

Summary

Immutable Design

Delving into the Spark RDD's parent/child chain

Using RDD in an immutable way

Using DataFrame operations to transform

Immutability in the highly concurrent environment

Using the Dataset API in an immutable way

Summary

Avoiding Shuffle and Reducing Operational Expenses

Detecting a shuffle in a process

Testing operations that cause a shuffle in Apache Spark

Changing the design of jobs with wide dependencies

Using keyBy() operations to reduce shuffle

Using a custom partitioner to reduce shuffle

Summary

Saving Data in the Correct Format

Saving data in plain text format

Leveraging JSON as a data format

Tabular formats – CSV

Using Avro with Spark

Columnar formats – Parquet

Summary

Working with the Spark Key/Value API

Available actions on key/value pairs

Using aggregateByKey instead of groupBy()

Actions on key/value pairs

Available partitioners on key/value data

Implementing a custom partitioner

Summary

Testing Apache Spark Jobs

Separating logic from Spark engine-unit testing

Integration testing using SparkSession

Mocking data sources using partial functions

Using ScalaCheck for property-based testing

Testing in different versions of Spark

Summary

Leveraging the Spark GraphX API

Creating a graph from a data source

Using the Vertex API

Using the Edge API

Calculating the degree of the vertex

Calculating PageRank

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

An overview of PySpark

Before we start with installing PySpark, which is the Python interface for Spark, let's go through some core concepts in Spark and PySpark. Spark is the latest big data tool from Apache, which can be found by simply going to http://spark.apache.org/. It's a unified analytics engine for large-scale data processing. This means that, if you have a lot of data, you can feed that data into Spark to create some analytics at a good speed. If we look at the running times between Hadoop and Spark, Spark is more than a hundred times faster than Hadoop. It is very easy to use because there are very good APIs for use with Spark.

The four major components of the Spark platform are as follows:

Spark SQL: A clearing language for Spark
Spark Streaming: Allows you to feed in real-time streaming data
MLlib (machine learning): The machine learning library for Spark
GraphX (graph): The graphing library for Spark

The core concept in Spark is an RDD, which is similar to the pandas DataFrame, or a Python dictionary or list. It is a way for Spark to store large amounts of data on the infrastructure for us. The key difference of an RDD versus something that is in your local memory, such as a pandas DataFrame, is that an RDD is distributed across many machines, but it appears like one unified dataset. What this means is, if you have large amounts of data that you want to operate on in parallel, you can put it in an RDD and Spark will handle parallelization and the clustering of the data for you.

Spark has three different interfaces, as follows:

Scala
Java
Python

Python is similar to PySpark integration, which we will cover soon. For now, we will import some libraries from the PySpark package to help us work with Spark. The best way for us to understand Spark is to look at an example, as shown in the following screenshot:

lines = sc.textFile("data.txt")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)

In the preceding code, we have created a new variable called lines by calling SC.textFile ("data.txt"). sc is our Python objects that represent our Spark cluster. A Spark cluster is a series of instances or cloud computers that store our Spark processes. By calling a textFile constructor and feeding in data.text, we have potentially fed in a large text file and created an RDD just using this one line. In other words, what we are trying to do here is to feed a large text file into a distributed cluster and Spark, and Spark handles this clustering for us.

In line two and line three, we have a MapReduce function. In line two, we have mapped the length function using a lambda function to each line of data.text. In line three, we have called a reduction function to add all lineLengths together to produce the total length of the documents. While Python's lines is a variable that contains all the lines in data.text, under the hood, Spark is actually handling the distribution of fragments of data.text in two different instances on the Spark cluster, and is handling the MapReduce computation over all of these instances.

Spark SQL

Spark SQL is one of the four components on top of the Spark platform, as we saw earlier in the chapter. It can be used to execute SQL queries or read data from any existing Hive insulation, where Hive is a database implementation also from Apache. Spark SQL looks very similar to MySQL or Postgres. The following code snippet is a good example:

#Register the DataFrame as a SQL temporary view
df.CreateOrReplaceTempView("people")

sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()

#+----+-------+
#| age|   name|
#+----+-------+
#+null|Jackson|
#|  30| Martin|
#|  19| Melvin|
#+----|-------|

You'll need to select all the columns from a certain table, such as people, and using the Spark objects, you'll feed in a very standard-looking SQL statement, which is going to show an SQL result much like what you would expect from a normal SQL implementation.

Let's now look at datasets and DataFrames. A dataset is a distributed collection of data. It is an interface added in Spark 1.6 that provides benefits on top of RDDs. A DataFrame, on the other hand, is very familiar to those who have used pandas or R. A DataFrame is simply a dataset organized into named columns, which is similar to a relational database or a DataFrame in Python. The main difference between a dataset and a DataFrame is that DataFrames have column names. As you can imagine, this would be very convenient for machine learning work and feeding into things such as scikit-learn.

Let's look at how DataFrames can be used. The following code snippet is a quick example of a DataFrame:

# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()

#+----+-------+
#| age|   name|
#+----+-------+
#+null|Jackson|
#|  30| Martin|
#|  19| Melvin|
#+----|-------|

In the same way, as pandas or R would do, read.json allows us to feed in some data from a JSON file, and df.show shows us the contents of the DataFrame in a similar way to pandas.

MLlib, as we know, is used to make machine learning scalable and easy. MLlib allows you to do common machine learning tasks, such as featurization; creating pipelines; saving and loading algorithms, models, and pipelines; and also some utilities, such as linear algebra, statistics, and data handling. The other thing to note is that Spark and RDD are almost inseparable concepts. If your main use case for Spark is machine learning, Spark now actually encourages you to use the DataFrame-based API for MLlib, which is quite beneficial to us as we are already familiar with pandas, which means a smooth transition into Spark.

In the next section, we will see how we can set up Spark on Windows, and set up PySpark as the interface.

Hands-On Big Data Analytics with PySpark

By : Rudy Lai, Bartłomiej Potaczek

Hands-On Big Data Analytics with PySpark

By: Rudy Lai, Bartłomiej Potaczek

Overview of this book

Related Content you might be interested in

Current Title:

Hands-On Big Data Analytics with PySpark

Apache Spark Quick Start Guide

Scala and Spark for Big Data Analytics

Apache Spark 2.x for Java Developers