Spark being an MPP environment generally does not provide a shared state as the code is executed in parallel on a remote cluster node. Separate copies of data and variables are generally used during the map()
or reduce()
phases, and providing an ability to have a read-write shared variable across multiple executing tasks would be grossly inefficient. Spark, however, provides two types of shared variables:
Broadcast variables
- Read-only variables cached on each machineAccumulators
- Variables that can be added through associative and commutative property
Largescale data movement is often a major factor in negatively affecting performance in MPP environments and hence every care is taken to reduce data movement while working on a clustered environment. One of the ways to reduce data movement is to cache frequently accessed data objects on the machines, which is essentially what Spark's broadcast variables are about - keep read-only variables cached on each...