Book Image

PySpark Cookbook

By : Denny Lee, Tomasz Drabas
Book Image

PySpark Cookbook

By: Denny Lee, Tomasz Drabas

Overview of this book

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. The PySpark Cookbook presents effective and time-saving recipes for leveraging the power of Python and putting it to use in the Spark ecosystem. You’ll start by learning the Apache Spark architecture and how to set up a Python environment for Spark. You’ll then get familiar with the modules available in PySpark and start using them effortlessly. In addition to this, you’ll discover how to abstract data with RDDs and DataFrames, and understand the streaming capabilities of PySpark. You’ll then move on to using ML and MLlib in order to solve any problems related to the machine learning capabilities of PySpark and use GraphFrames to solve graph-processing problems. Finally, you will explore how to deploy your applications to the cloud using the spark-submit command. By the end of this book, you will be able to use the Python API for Apache Spark to solve any problems associated with building data-intensive applications.
Table of Contents (13 chapters)
Title Page
Packt Upsell

Configuring a session in Jupyter

Working in Jupyter is great as it allows you to develop your code interactively, and document and share your notebooks with colleagues. The problem, however, with running Jupyter against a local Spark instance is that the SparkSession gets created automatically and by the time the notebook is running, you cannot change much in that session's configuration.

In this recipe, we will learn how to install Livy, a REST service to interact with Spark, and sparkmagic, a package that will allow us to configure sessions interactively as well:


Getting ready

We assume that you either have installed Spark via binaries or compiled the sources as we have shown you in the previous recipes. In other words, by now, you should have a working Spark environment. You will also need Jupyter: if you do not have it, follow the steps from the previous recipe to install it. 

No other prerequisites are required.

How to do it...

To install Livy and sparkmagicwe have created a script that will do this automatically with minimal interaction from you. You can find it in the Chapter01/ folder. You should be familiar with most of the functions that we're going to use here by now, so we will focus only on those that are different (highlighted in bold in the following code). Here is the high-level view of the script's structure:


# Shell script for installing Spark from binaries 
# PySpark Cookbook
# Author: Tomasz Drabas, Denny Lee
# Version: 0.1
# Date: 12/2/2017

_livy_archive=$( echo "$_livy_binary" | awk -F '/' '{print $NF}' )
_livy_dir=$( echo "${_livy_archive%.*}" )
downloadThePackage $( echo "${_livy_binary}" )
unpack $( echo "${_livy_archive}" )
moveTheBinaries $( echo "${_livy_dir}" ) $( echo "${_livy_destination}" )

# create log directory inside the folder
mkdir -p "$_livy_destination/logs"


How it works...

As with all other scripts we have presented so far, we will begin by setting some global variables.


If you do not know what these mean, check the Installing Spark from sources recipe.

Livy requires some configuration files from Hadoop. Thus, as part of this script, we allow you to install Hadoop should it not be present on your machine. That is why we now allow you to pass arguments to the downloadThePackage, unpack, and moveTheBinaries functions.


The changes to the functions are fairly self-explanatory, so for the sake of space, we will not be pasting the code here. You are more than welcome, though, to peruse the relevant portions of the script.

Installing Livy drills down literally to downloading the package, unpacking it, and moving it to its final destination (in our case, this is /opt/livy). 

Checking if Hadoop is installed is the next thing on our to-do list. To run Livy with local sessions, we require two environment variables: SPARK_HOME and HADOOP_CONF_DIR; the SPARK_HOME is definitely set but if you do not have Hadoop installed, you most likely will not have the latter environment variable set:

function checkHadoop() {
    if type -p hadoop; then
        echo "Hadoop executable found in PATH"
    elif [[ -n "$HADOOP_HOME" ]] && [[ -x "$HADOOP_HOME/bin/hadoop" ]]; then
        echo "Found Hadoop executable in HADOOP_HOME"
        echo "No Hadoop found. You should install Hadoop first. You can still continue but some functionality might not be available. "
        echo -n "Do you want to install the latest version of Hadoop? [y/n]: "
        read _install_hadoop

        case "$_install_hadoop" in
            y*) installHadoop ;;
            n*) echo "Will not install Hadoop" ;;
            *)  echo "Will not install Hadoop" ;;

function installHadoop() {
    _hadoop_archive=$( echo "$_hadoop_binary" | awk -F '/' '{print $NF}' )
    _hadoop_dir=$( echo "${_hadoop_archive%.*}" )
    _hadoop_dir=$( echo "${_hadoop_dir%.*}" )

    downloadThePackage $( echo "${_hadoop_binary}" )
    unpack $( echo "${_hadoop_archive}" )
    moveTheBinaries $( echo "${_hadoop_dir}" ) $( echo "${_hadoop_destination}" )

The checkHadoop function first checks if the hadoop binary is present on the PATH; if not, it will check if the HADOOP_HOME variable is set and, if it is, it will check if the hadoop binary can be found inside the $HADOOP_HOME/bin folder. If both attempts fail, the script will ask you if you want to install the latest version of Hadoop; the default answer is n but if you answer y, the installation will begin.

Once the installation finishes, we will begin installing the additional kernels for the Jupyter Notebooks.


A kernel is a piece of software that translates the commands from the frontend notebook to the backend environment (like Python). For a list of available Jupyter kernels check out the following link: Here are some instructions on how to develop a kernel yourself:

Here's the function that handles the kernel's installation:

function installJupyterKernels() {
    # install the library 
    pip install sparkmagic

    # ipywidgets should work properly
    jupyter nbextension enable --py --sys-prefix widgetsnbextension 

    # install kernels
    # get the location of sparkmagic
    _sparkmagic_location=$(pip show sparkmagic | awk -F ':' '/Location/ {print $2}') 

    _temp_dir=$(pwd) # store current working directory

    cd $_sparkmagic_location # move to the sparkmagic folder
    jupyter-kernelspec install sparkmagic/kernels/sparkkernel
    jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
    jupyter-kernelspec install sparkmagic/kernels/pyspark3kernel


    # enable the ability to change clusters programmatically
    jupyter serverextension enable --py sparkmagic

    # install autowizwidget
    pip install autovizwidget

    cd $_temp_dir

First, we install the sparkmagic package for Python. Quoting directly from

"Sparkmagic is a set of tools for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter Notebooks. The Sparkmagic project includes a set of magics for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment."

The following command enables the Javascript extensions in Jupyter Notebooks so that ipywidgets can work properly; if you have an Anaconda distribution of Python, this package will be installed automatically.

Following this, we install the kernels. We need to switch to the folder where sparkmagic was installed into. The pip show <package> command displays all relevant information about the installed packages; from the output, we only extract the Location using awk.

To install the kernels, we use the jupyter-kernelspec install <kernel> command. For example, the command will install the sparkmagic kernel for the Scala API of Spark:

jupyter-kernelspec install sparkmagic/kernels/sparkkernel 

Once all the kernels are installed, we enable Jupyter to use sparkmagic so that we can change clusters programmatically. Finally, we will install the autovizwidget, an auto-visualization library for pandas dataframes.

This concludes the Livy and sparkmagic installation part.

There's more...

Now that we have everything in place, let's see what this can do. 

First, start Jupyter (note that we do not use the pyspark command):

jupyter notebook

You should now be able to see the following options if you want to add a new notebook:

If you click on PySpark, it will open a notebook and connect to a kernel. 

There are a number of available magics to interact with the notebooks; type %%help to list them all. Here's the list of the most important ones:






Outputs session information from Livy.


%%cleanup -f

Delete all sessions running on the current Livy endpoint. The -f switch forces the cleanup.


%%delete -f -s 0

Deletes the session specified by the -s switch; the -f switch forces the deletion.


%%configure -f

{"executorMemory": "1000M", "executorCores": 4}

Arguably the most useful magic. Allows you to configure your session. Check for the full list of available configuration parameters.


%%sql -o tables -q


Executes an SQL query against the current SparkSession.




All the code in the notebook cell with this magic will be executed locally against the Python environment.

Once you have configured your session, you will get information back from Livy about the active sessions that are currently running:

Let's try to create a simple data frame using the following code:

from pyspark.sql.types import *

# Generate our data 
ListRDD = sc.parallelize([
    (123, 'Skye', 19, 'brown'), 
    (223, 'Rachel', 22, 'green'), 
    (333, 'Albert', 23, 'blue')

# The schema is encoded using StructType 
schema = StructType([
    StructField("id", LongType(), True), 
    StructField("name", StringType(), True),
    StructField("age", LongType(), True),
    StructField("eyeColor", StringType(), True)

# Apply the schema to the RDD and create DataFrame
drivers = spark.createDataFrame(ListRDD, schema)

# Creates a temporary view using the data frame

Once you execute the preceding code in a cell inside the notebook, only then will the SparkSession be created:

If you execute %%sql magic, you will get the following:

See also

  • Check the Livy REST API in case you want to submit jobs programmatically: Also, for a list of configurable parameters available in sparkmagic, go to: