Book Image

Apache Spark Deep Learning Cookbook

By : Ahmed Sherif, Amrith Ravindra
Book Image

Apache Spark Deep Learning Cookbook

By: Ahmed Sherif, Amrith Ravindra

Overview of this book

Organizations these days need to integrate popular big data tools such as Apache Spark with highly efficient deep learning libraries if they’re looking to gain faster and more powerful insights from their data. With this book, you’ll discover over 80 recipes to help you train fast, enterprise-grade, deep learning models on Apache Spark. Each recipe addresses a specific problem, and offers a proven, best-practice solution to difficulties encountered while implementing various deep learning algorithms in a distributed environment. The book follows a systematic approach, featuring a balance of theory and tips with best practice solutions to assist you with training different types of neural networks such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). You’ll also have access to code written in TensorFlow and Keras that you can run on Spark to solve a variety of deep learning problems in computer vision and natural language processing (NLP), or tweak to tackle other problems encountered in deep learning. By the end of this book, you'll have the skills you need to train and deploy state-of-the-art deep learning models on Apache Spark.
Table of Contents (21 chapters)
Title Page
Copyright and Credits
Packt Upsell
Foreword
Contributors
Preface
Index

Starting and configuring a Spark cluster


For most chapters, one of the first things that we will do is to initialize and configure our Spark cluster.

Getting ready

Import the following before initializing cluster.

  • from pyspark.sql import SparkSession

How to do it...

This section walks through the steps to initialize and configure a Spark cluster.

  1. Import SparkSession using the following script:
from pyspark.sql import SparkSession
  1. Configure SparkSession with a variable named spark using the following script:
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("GenericAppName") \
    .config("spark.executor.memory", "6gb") \
.getOrCreate()

How it works...

This section explains how the SparkSession works as an entry point to develop within Spark.

  1. Staring with Spark 2.0, it is no longer necessary to create a SparkConf and SparkContext to begin development in Spark. Those steps are no longer needed as importing SparkSession will handle initializing a cluster.  Additionally, it is important to note that SparkSession is part of the sql module from pyspark.
  2. We can assign properties to our SparkSession:
    1. master: assigns the Spark master URL to run on our local machine with the maximum available number of cores.  
    2. appName: assign a name for the application
    3.  config: assign 6gb to the spark.executor.memory
    4. getOrCreate: ensures that a SparkSession is created if one is not available and retrieves an existing one if it is available

There's more...

For development purposes, while we are building an application on smaller datasets, we can just use master("local").  If we were to deploy on a production environment, we would want to specify master("local[*]") to ensure we are using the maximum cores available and get optimal performance.

See also

To learn more about SparkSession.builder, visit the following website:

https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/SparkSession.Builder.html