For most chapters, one of the first things that we will do is to initialize and configure our Spark cluster.
This section walks through the steps to initialize and configure a Spark cluster.
- Import
SparkSession
using the following script:
from pyspark.sql import SparkSession
- Configure
SparkSession
with a variable namedspark
using the following script:
spark = SparkSession.builder \ .master("local[*]") \ .appName("GenericAppName") \ .config("spark.executor.memory", "6gb") \ .getOrCreate()
This section explains how the SparkSession
works as an entry point to develop within Spark.
- Staring with Spark 2.0, it is no longer necessary to create a
SparkConf
andSparkContext
to begin development in Spark. Those steps are no longer needed as importingSparkSession
will handle initializing a cluster. Additionally, it is important to note thatSparkSession
is part of thesql
module frompyspark
. - We can assign properties to our
SparkSession
:master
: assigns the Spark master URL to run on ourlocal
machine with the maximum available number of cores.appName
: assign a name for the application-
config
: assign6gb
to thespark.executor.memory
getOrCreate
: ensures that aSparkSession
is created if one is not available and retrieves an existing one if it is available
For development purposes, while we are building an application on smaller datasets, we can just use master("local")
. If we were to deploy on a production environment, we would want to specify master("local[*]")
to ensure we are using the maximum cores available and get optimal performance.
To learn more about SparkSession.builder
, visit the following website:
https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/SparkSession.Builder.html