There is actually not much you need to do to configure a local instance of Spark. The beauty of Spark is that all you need to do to get started is to follow either of the previous two recipes (installing from sources or from binaries) and you can begin using it. In this recipe, however, we will walk you through the most useful SparkSession
configuration options.
In order to follow this recipe, a working Spark environment is required. This means that you will have to have gone through the previous three recipes and have successfully installed and tested your environment, or had a working Spark environment already set up.
No other prerequisites are necessary.
To configure your session, in a Spark version which is lower that version 2.0, you would normally have to create a SparkConf
object, set all your options to the right values, and then build the SparkContext
( SqlContext
if you wanted to use DataFrames
, and HiveContext
if you wanted access to Hive tables). Starting from Spark 2.0, you just need to create a SparkSession
, just like in the following snippet:
spark = SparkSession.builder \ .master("local[2]") \ .appName("Your-app-name") \ .config("spark.some.config.option", "some-value") \ .getOrCreate()
To create a SparkSession
, we will use the Builder
class (accessed via the .builder
property of the SparkSession
class). You can specify some basic properties of the SparkSession
here:
- The
.master(...)
allows you to specify the driver node (in our preceding example, we would be running a local session with two cores)
- The
.appName(...)
gives you means to specify a friendly name for your app
- The
.config(...)
method allows you to refine your session's behavior further; the list of the most importantSparkSession
parameters is outlined in the following table - The
.getOrCreate()
method returns either a newSparkSession
if one has not been created yet, or returns a pointer to an already existingSparkSession
The following table gives an example list of the most useful configuration parameters for a local instance of Spark:
Note
Some of these parameters are also applicable if you are working in a cluster environment with multiple worker nodes. In the next recipe, we will explain how to set up and administer a multi-node Spark cluster deployed over YARN.
There are some environment variables that also allow you to further fine-tune your Spark environment. Specifically, we are talking about thePYSPARK_DRIVER_PYTHON
and PYSPARK_DRIVER_PYTHON_OPTS
variables. We have already covered these in theInstalling Spark from sourcesrecipe.
- Check the full list of all available configuration options here: https://spark.apache.org/docs/latest/configuration.html