Serialization plays an important part in distributed computing. There are two persistence (storage) levels that support serializing RDDs:
MEMORY_ONLY_SER
: This stores RDDs as serialized objects. It will create one byte array per partition.MEMORY_AND_DISK_SER
: This is similar toMEMORY_ONLY_SER
, but it spills partitions that do not fit in the memory to disk.
The following are the steps to add appropriate persistence levels:
- Start the Spark shell:
$ spark-shell
- Import the
StorageLevel
object as enumeration of persistence levels and the implicits associated with it:
scala> import org.apache.spark.storage.StorageLevel._
- Create a dataset:
scala> val words = spark.read.textFile("words")
- Persist the dataset:
scala> words.persist(MEMORY_ONLY_SER)
Though serialization reduces the memory footprint substantially, it adds extra CPU cycles due to deserialization.