Working in Jupyter is great as it allows you to develop your code interactively, and document and share your notebooks with colleagues. The problem, however, with running Jupyter against a local Spark instance is that the SparkSession
gets created automatically and by the time the notebook is running, you cannot change much in that session's configuration.
In this recipe, we will learn how to install Livy, a REST service to interact with Spark, and sparkmagic
, a package that will allow us to configure sessions interactively as well:
Source: http://bit.ly/2iO3EwC
We assume that you either have installed Spark via binaries or compiled the sources as we have shown you in the previous recipes. In other words, by now, you should have a working Spark environment. You will also need Jupyter: if you do not have it, follow the steps from the previous recipe to install it.
No other prerequisites are required.
To install Livy and sparkmagic
, we have created a script that will do this automatically with minimal interaction from you. You can find it in the Chapter01/installLivy.sh
folder. You should be familiar with most of the functions that we're going to use here by now, so we will focus only on those that are different (highlighted in bold in the following code). Here is the high-level view of the script's structure:
#!/bin/bash # Shell script for installing Spark from binaries # # PySpark Cookbook # Author: Tomasz Drabas, Denny Lee # Version: 0.1 # Date: 12/2/2017 _livy_binary="http://mirrors.ocf.berkeley.edu/apache/incubator/livy/0.4.0-incubating/livy-0.4.0-incubating-bin.zip" _livy_archive=$( echo "$_livy_binary" | awk -F '/' '{print $NF}' ) _livy_dir=$( echo "${_livy_archive%.*}" ) _livy_destination="/opt/livy" _hadoop_destination="/opt/hadoop" ... checkOS printHeader createTempDir downloadThePackage $( echo "${_livy_binary}" ) unpack $( echo "${_livy_archive}" ) moveTheBinaries $( echo "${_livy_dir}" ) $( echo "${_livy_destination}" ) # create log directory inside the folder mkdir -p "$_livy_destination/logs" checkHadoop installJupyterKernels setSparkEnvironmentVariables cleanUp
As with all other scripts we have presented so far, we will begin by setting some global variables.
Livy requires some configuration files from Hadoop. Thus, as part of this script, we allow you to install Hadoop should it not be present on your machine. That is why we now allow you to pass arguments to the downloadThePackage
, unpack
, and moveTheBinaries
functions.
Note
The changes to the functions are fairly self-explanatory, so for the sake of space, we will not be pasting the code here. You are more than welcome, though, to peruse the relevant portions of the installLivy.sh
script.
Installing Livy drills down literally to downloading the package, unpacking it, and moving it to its final destination (in our case, this is /opt/livy
).
Checking if Hadoop is installed is the next thing on our to-do list. To run Livy with local sessions, we require two environment variables: SPARK_HOME
and HADOOP_CONF_DIR
; the SPARK_HOME
is definitely set but if you do not have Hadoop installed, you most likely will not have the latter environment variable set:
function checkHadoop() { if type -p hadoop; then echo "Hadoop executable found in PATH" _hadoop=hadoop elif [[ -n "$HADOOP_HOME" ]] && [[ -x "$HADOOP_HOME/bin/hadoop" ]]; then echo "Found Hadoop executable in HADOOP_HOME" _hadoop="$HADOOP_HOME/bin/hadoop" else echo "No Hadoop found. You should install Hadoop first. You can still continue but some functionality might not be available. " echo echo -n "Do you want to install the latest version of Hadoop? [y/n]: " read _install_hadoop case "$_install_hadoop" in y*) installHadoop ;; n*) echo "Will not install Hadoop" ;; *) echo "Will not install Hadoop" ;; esac fi } function installHadoop() { _hadoop_binary="http://mirrors.ocf.berkeley.edu/apache/hadoop/common/hadoop-2.9.0/hadoop-2.9.0.tar.gz" _hadoop_archive=$( echo "$_hadoop_binary" | awk -F '/' '{print $NF}' ) _hadoop_dir=$( echo "${_hadoop_archive%.*}" ) _hadoop_dir=$( echo "${_hadoop_dir%.*}" ) downloadThePackage $( echo "${_hadoop_binary}" )
unpack $( echo "${_hadoop_archive}" ) moveTheBinaries $( echo "${_hadoop_dir}" ) $( echo "${_hadoop_destination}" ) }
The checkHadoop
function first checks if the hadoop
binary is present on the PATH
; if not, it will check if the HADOOP_HOME
variable is set and, if it is, it will check if the hadoop
binary can be found inside the $HADOOP_HOME/bin
folder. If both attempts fail, the script will ask you if you want to install the latest version of Hadoop; the default answer is n
but if you answer y
, the installation will begin.
Once the installation finishes, we will begin installing the additional kernels for the Jupyter Notebooks.
Note
A kernel is a piece of software that translates the commands from the frontend notebook to the backend environment (like Python). For a list of available Jupyter kernels check out the following link: https://github.com/jupyter/jupyter/wiki/Jupyter-kernels. Here are some instructions on how to develop a kernel yourself: http://jupyter-client.readthedocs.io/en/latest/kernels.html.
Here's the function that handles the kernel's installation:
function installJupyterKernels() { # install the library pip install sparkmagic echo # ipywidgets should work properly jupyter nbextension enable --py --sys-prefix widgetsnbextension echo # install kernels # get the location of sparkmagic _sparkmagic_location=$(pip show sparkmagic | awk -F ':' '/Location/ {print $2}') _temp_dir=$(pwd) # store current working directory cd $_sparkmagic_location # move to the sparkmagic folder jupyter-kernelspec install sparkmagic/kernels/sparkkernel jupyter-kernelspec install sparkmagic/kernels/pysparkkernel jupyter-kernelspec install sparkmagic/kernels/pyspark3kernel echo # enable the ability to change clusters programmatically jupyter serverextension enable --py sparkmagic echo # install autowizwidget pip install autovizwidget cd $_temp_dir }
First, we install the sparkmagic
package for Python. Quoting directly from https://github.com/jupyter-incubator/sparkmagic:
"Sparkmagic is a set of tools for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter Notebooks. The Sparkmagic project includes a set of magics for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment."
The following command enables the Javascript extensions in Jupyter Notebooks so that ipywidgets
can work properly; if you have an Anaconda distribution of Python, this package will be installed automatically.
Following this, we install the kernels. We need to switch to the folder where sparkmagic
was installed into. The pip show <package>
command displays all relevant information about the installed packages; from the output, we only extract the Location
using awk
.
To install the kernels, we use the jupyter-kernelspec install <kernel>
command. For example, the command will install the sparkmagic
kernel for the Scala API of Spark:
jupyter-kernelspec install sparkmagic/kernels/sparkkernel
Once all the kernels are installed, we enable Jupyter to use sparkmagic
so that we can change clusters programmatically. Finally, we will install the autovizwidget
, an auto-visualization library for pandas dataframes
.
This concludes the Livy and sparkmagic
installation part.
Now that we have everything in place, let's see what this can do.
First, start Jupyter (note that we do not use the pyspark
command):
jupyter notebook
You should now be able to see the following options if you want to add a new notebook:
If you click on PySpark, it will open a notebook and connect to a kernel.
There are a number of available magics to interact with the notebooks; type %%help
to list them all. Here's the list of the most important ones:
Magic | Example | Explanation |
|
| Outputs session information from Livy. |
|
| Delete all sessions running on the current Livy endpoint. The |
|
| Deletes the session specified by the |
|
| Arguably the most useful magic. Allows you to configure your session. Check http://bit.ly/2kSKlXr for the full list of available configuration parameters. |
|
| Executes an SQL query against the current |
|
| All the code in the notebook cell with this magic will be executed locally against the Python environment. |
Once you have configured your session, you will get information back from Livy about the active sessions that are currently running:
Let's try to create a simple data frame using the following code:
from pyspark.sql.types import * # Generate our data ListRDD = sc.parallelize([ (123, 'Skye', 19, 'brown'), (223, 'Rachel', 22, 'green'), (333, 'Albert', 23, 'blue') ]) # The schema is encoded using StructType schema = StructType([ StructField("id", LongType(), True), StructField("name", StringType(), True), StructField("age", LongType(), True), StructField("eyeColor", StringType(), True) ]) # Apply the schema to the RDD and create DataFrame drivers = spark.createDataFrame(ListRDD, schema) # Creates a temporary view using the data frame drivers.createOrReplaceTempView("drivers")
Once you execute the preceding code in a cell inside the notebook, only then will the SparkSession
be created:
If you execute %%sql
magic, you will get the following:
- Check the Livy REST API in case you want to submit jobs programmatically: https://livy.incubator.apache.org/docs/latest/rest-api.html. Also, for a list of configurable parameters available in
sparkmagic
, go to: https://github.com/jupyter-incubator/sparkmagic/blob/master/examples/Pyspark%20Kernel.ipynb.