Book Image

Learning Apache Spark 2

Book Image

Learning Apache Spark 2

Overview of this book

Apache Spark has seen an unprecedented growth in terms of its adoption over the last few years, mainly because of its speed, diversity and real-time data processing capabilities. It has quickly become the preferred choice of tool for many Big Data professionals looking to find quick insights from large chunks of data. This book introduces you to the Apache Spark framework, and familiarizes you with all the latest features and capabilities introduced in Spark 2. Starting with a detailed introduction to Spark’s architecture and the installation procedure, this book covers everything you need to know about the Spark framework in the most practical manner. You will learn how to perform the basic ETL activities using Spark, and work with different components of Spark such as Spark SQL, as well as the Dataset and DataFrame APIs for manipulating your data. Then, you will perform machine learning using Spark MLlib, as well as perform streaming analytics and graph processing using the Spark Streaming and GraphX modules respectively. The book also gives special emphasis on deploying your Spark models, and how they can be operated in a clustered mode. During the course of the book, you will come across implementations of different real-world use-cases and examples, giving you the hands-on knowledge you need to use Apache Spark in the best possible manner.
Table of Contents (18 chapters)
Learning Apache Spark 2
Credits
About the Author
About the Reviewers
www.packtpub.com
Customer Feedback
Preface

Setting up Jupyter Notebook with Spark


In this section we will look at how to setup a Jupyter notebook with Spark. For those of you who haven't yet been able to grasp the concept of the notebook environment, it is important to understand the benefits as opposed to a traditional environment. Please do note that Jupyter Notebook is one of the many options that users have.

What is a Jupyter Notebook?

A Jupyter Notebook is an interactive computational environment which can combine execution of code, integrating rich media and text and visualizing your data with numerous visualization libraries. The notebook itself is just a small web application that you can use to create documents, and add explanatory text before sharing them with your peers or colleagues. Jupyter notebooks are being used at Google. Microsoft, IBM, NASA, and Bloomberg among many other leading companies.

Setting up a Jupyter Notebook

Following are the steps to set up a Jupyter Notebook:

  • Pre-requisites - You would need Python 2.7 or Python >=3.3 for installing Jupyter Notebook.
  • Install Anaconda - Anaconda is recommended as it will install Python, Jupyter Notebook and other commonly used packages for scientific computing and data science. You can download  Anaconda from the following link: https://www.continuum.io/downloads.

Figure 11-6: Installing Anaconda-1

You can click the link to get access to the installer and download it on your Linux system:

Figure 117: Installing Anaconda-2

Once you have downloaded Anaconda, you can go ahead and install it.

Figure 11.7: Installing Anaconda-3

The installer will ask you questions arbout the install location, and walk you through the license agreement, before asking you to confirm of installation and weather it should add the path to the bashrc file. You can then start the notebook using the following command:

jupyter notebook

However, please bear in mind that by default a notebook server runs locally at 127.0.0.1:8888. If this is what you are looking for, then this is great. However, if you like to open it to the public, you will need to secure your notebook server.

Securing the notebook server

Notebook server can be protected by a simple single password by configuring NotebookApp.password setting in the following file: Jupyter_notebook_config.py.

This file should be located in your home directory: ~/.jupyter. If you have just installed Anaconda, you might not have this directory. You can create this by executing the following command:

jupyter notebook --generate-config

Running this command will create a ~/.jupyter directory and will create a default configuration file:

Figure 11.9: Securing Jupyter for public access

Preparing a hashed password

You can use Jupyter to create a hashed password or prepare it manually.

Using Jupyter (only with version 5.0 and later)

You can issue the following command to create a hashed password:

jupyter notebook password

This will save the password in your ~/.jupyter director in a file called jupyter_notebook_config.json.

Manually creating hashed password

You can use Python to manually create the hashed password:

Figure 11.10: Manually creating a hashed password

You can use either of these passwords in your jupyter_notebook_config.py and replace the parameter value for c.NotebookApp.password.

c.NotebookApp.password = u'sha1:cd7ef63fc00a:2816fd7ed6a47ac9aeaa2477c1587fd18ab1ecdc'

Figure 11-11: Using the generated hashed password

By default the Notebook runs on port 8888; you'll see the option to change the port as well.

Since we want to allow public access to the notebook, we have to allow all IP's to access the notebook using any of the configured network interfaces for the public server. This can be done by making the following changes:

Figure 11.12: Configuring Notebook server to listen on all interfaces

You can now run Jupyter, and access it from any computer with access to the notebook server:

Figure 11-13: Jupyter interface

Setting up PySpark on Jupyter

The next step is to integrate PySpark with Jupyter notebook. You have to do following steps to setup PySpark:

  1. Update your bashrc file and set the following variables:

            # added by Anaconda3 4.3.0 installer
            export PATH="/root/anaconda3/bin:$PATH"
            PYSPARK_PYTHON=/usr/bin/python
            PYSPARK_DRIVER_PYTHON=/usr/bin/python
            SPARK_HOME=/spark/spark-2.0.2/
            PATH=$PATH:/spark/spark-2.0.2/bin
            PYSPARK_DRIVER_PYTHON=jupyter
            PYSPARK_DRIVER_PYTHON_OPTS=notebook
  2. Configure PySpark Kernel: Create a file /usr/local/share/jupyter/kernels/pyspark/kernel.json with the following parameters:

            {
              "display_name": "PySpark",
              "language": "python",
              "argv": [ "/root/anaconda3/bin/python", "-m", "ipykernel",
              "-f", "{connection_file}" ],
              "env": {
                "SPARK_HOME": "/spark/spark-2.0.2/",
                "PYSPARK_PYTHON":"/root/anaconda3/bin/python",
                "PYTHONPATH": "/spark/spark-2.0.2/python/:/spark/
                spark-2.0.2/python/lib/py4j-0.10.3-src.zip",
                "PYTHONSTARTUP": "/spark/spark-2.0.2/python/pyspark/
                 shell.py",
                "PYSPARK_SUBMIT_ARGS": "--master spark://sparkmaster:7077
                 pyspark-shell"
              }
            }
  3. Open the notebook: Now when you open the Notebook with jupyter notebook command, you will find an additional kernel installed. You can create new Notebooks with the new Kernel:

    Figure 11.14: New Kernel