Now that we have Spark running in our shell, we can learn about programming in greater detail. A Spark application consists of a driver program, which is responsible for distribution of the operations among the cluster members. The driver program also distributes the data structure fragments in the cluster, and then applies operations in a distributed way.
The driver programs access the SparkContext
object representing the connection to the cluster. In the shell, it's always accessed through the sc
variable. To see what type sc
is:
scala>sc res1: org.apache.spark.SparkContext = org.apache.spark.SparkContext@e4b54d3
To run operations, driver programs have a number of nodes called executors. For example, if we run a simple count()
operation in a cluster, the count()
operation work is distributed among all the cluster members, each on their portion of file assigned to them by the driver program.
In our examples, as we only have one machine where we run the Spark shell...