PySpark Cookbook

PySpark Cookbook

By : Denny Lee, Tomasz Drabas

Buy this Book

PySpark Cookbook

By: Denny Lee, Tomasz Drabas

Buy this Book

Overview of this book

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. The PySpark Cookbook presents effective and time-saving recipes for leveraging the power of Python and putting it to use in the Spark ecosystem. You’ll start by learning the Apache Spark architecture and how to set up a Python environment for Spark. You’ll then get familiar with the modules available in PySpark and start using them effortlessly. In addition to this, you’ll discover how to abstract data with RDDs and DataFrames, and understand the streaming capabilities of PySpark. You’ll then move on to using ML and MLlib in order to solve any problems related to the machine learning capabilities of PySpark and use GraphFrames to solve graph-processing problems. Finally, you will explore how to deploy your applications to the cloud using the spark-submit command. By the end of this book, you will be able to use the Python API for Apache Spark to solve any problems associated with building data-intensive applications.

Title Page

Packt Upsell

Contributors

Preface

Free Chapter

Installing and Configuring Spark

Introduction

Installing Spark requirements

Installing Spark from sources

Installing Spark from binaries

Configuring a local instance of Spark

Configuring a multi-node instance of Spark

Installing Jupyter

Configuring a session in Jupyter

Working with Cloudera Spark images

Abstracting Data with RDDs

Introduction

Creating RDDs

Reading data from files

Overview of RDD transformations

Overview of RDD actions

Pitfalls of using RDDs

Abstracting Data with DataFrames

Introduction

Creating DataFrames

Accessing underlying RDDs

Performance optimizations

Inferring the schema using reflection

Specifying the schema programmatically

Creating a temporary table

Using SQL to interact with DataFrames

Overview of DataFrame transformations

Overview of DataFrame actions

Preparing Data for Modeling

Introduction

Handling duplicates

Handling missing observations

Handling outliers

Exploring descriptive statistics

Computing correlations

Drawing histograms

Visualizing interactions between features

Machine Learning with MLlib

Loading the data

Exploring the data

Testing the data

Transforming the data

Standardizing the data

Creating an RDD for training

Predicting hours of work for census respondents

Forecasting the income levels of census respondents

Building a clustering models

Computing performance statistics

Machine Learning with the ML Module

Introducing Transformers

Introducing Estimators

Introducing Pipelines

Selecting the most predictable features

Predicting forest coverage types

Estimating forest elevation

Clustering forest cover types

Tuning hyperparameters

Extracting features from text

Discretizing continuous variables

Standardizing continuous variables

Topic mining

Structured Streaming with PySpark

Introduction

Understanding DStreams

Understanding global aggregations

Continuous aggregation with structured streaming

GraphFrames – Graph Theory with PySpark

Introduction

Installing GraphFrames

Preparing the data

Building the graph

Running queries against the graph

Understanding the graph

Using PageRank to determine airport ranking

Finding the fewest number of connections

Visualizing the graph

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Installing Spark from binaries

Installing Spark from already precompiled binaries is even easier than doing the same from the sources. In this recipe, we will show you how to do this by downloading the binaries from the web or by using pip.

Getting ready

To execute this recipe, you will need a bash Terminal and an internet connection. Also, to follow through with this recipe, you will need to have already checked and/or installed all the required environments we went through in the Installing Spark requirements recipe. In addition, you need to have administrative privileges (via the sudo command), as these will be necessary to move the compiled binaries to the destination folder.

Note

If you are not an administrator on your machine, you can call the script with the -ns (or --nosudo) parameter. The destination folder will then switch to your home directory and will create a spark folder within it; by default, the binaries will be moved to the /opt/spark folder and that's why you need administrative rights.

No other prerequisites are required.

How to do it...

To install from the binaries, we only need four steps (see the following source code) as we do not need to compile the sources:

Download the precompiled binaries from Spark's website.
Unpack the archive.
Move to the final destination.
Create the necessary environmental variables.

The skeleton for our code looks as follows (see the Chapter01/installFromBinary.sh file):

#!/bin/bash

# Shell script for installing Spark from binaries

#
# PySpark Cookbook
# Author: Tomasz Drabas, Denny Lee
# Version: 0.1
# Date: 12/2/2017

_spark_binary="http://mirrors.ocf.berkeley.edu/apache/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz"
_spark_archive=$( echo "$_spark_binary" | awk -F '/' '{print $NF}' )
_spark_dir=$( echo "${_spark_archive%.*}" )
_spark_destination="/opt/spark"

...

checkOS
printHeader
downloadThePackage
unpack
moveTheBinaries
setSparkEnvironmentVariables
cleanUp

How it works...

The code is exactly the same as with the previous recipe so we will not be repeating it here; the only major difference is that we do not have the build stage in this script, and the _spark_source variable is different.

As in the previous recipe, we start by specifying the location of Spark's source code, which is in _spark_source. The _spark_archive contains the name of the archive; we use awk to extract the last element. The _spark_dir contains the name of the directory our archive will unpack into; in our current case, this will be spark-2.3.1. Finally, we specify our destination folder where we will be moving the binaries to: it will either be /opt/spark (default) or your home directory if you use the -ns (or --nosudo) switch when calling the ./installFromBinary.sh script.

Next, we check the OS name. Depending on whether you work in a Linux or macOS environment, we will use different tools to download the archive from the internet (check the downloadThePackage function). Also, when setting up the environment variables, we will output to different bash profile files: the .bash_profile on macOS and the .bashrc on Linux (check the setEnvironmentVariables function).

Following the OS check, we download the package: on macOS, we use curl and on Linux, we use wget tools to attain this goal. Once the package is downloaded, we unpack it using the tar tool, and then move it to its destination folder. If you are running with sudo privileges (without the -ns or --nosudo parameters), the binaries will be moved to the /opt/spark folder; if not—they will end up in the ~/spark folder.

Finally, we add environment variables to the appropriate bash profile files: check the previous recipe for an explanation of what is being added and for what reason. Also, follow the steps at the end of the previous recipe to test if your environment is working properly.

There's more...

Nowadays, there is an even simpler way to install PySpark on your machine, that is, by using pip.

Note

pip is Python's package manager. If you installed Python 2.7.9 or Python 3.4 from http://python.org, then pip is already present on your machine (the same goes for our recommended Python distribution—Anaconda). If you do not have pip, you can easily install it from here: https://pip.pypa.io/en/stable/installing/.

To install PySpark via pip, just issue the following command in the Terminal:

pip install pyspark

Or, if you use Python 3.4+, you may also try:

pip3 install pyspark

You should see the following screen in your Terminal:

PySpark Cookbook

By : Denny Lee, Tomasz Drabas

PySpark Cookbook

By: Denny Lee, Tomasz Drabas

Overview of this book

Related Content you might be interested in

Current Title:

PySpark Cookbook

Learning PySpark

Apache Spark Quick Start Guide

Mastering Apache Spark 2.x

Installing Spark from binaries

Getting ready

Note

How to do it...

How it works...

There's more...

Note