Serialization plays an important part in distributed computing. There are two persistence (storage) levels, which support serializing RDDs:
MEMORY_ONLY_SER
: This stores RDDs as serialized objects. It will create one byte array per partitionMEMORY_AND_DISK_SER
: This is similar to theMEMORY_ONLY_SER
, but it spills partitions that do not fit in the memory to disk
The following are the steps to add appropriate persistence levels:
Start the Spark shell:
$ spark-shell
Import the
StorageLevel
and implicits associated with it:scala> import org.apache.spark.storage.StorageLevel._
Create an RDD:
scala> val words = sc.textFile("words")
Persist the RDD:
scala> words.persist(MEMORY_ONLY_SER)
Though serialization reduces the memory footprint substantially, it adds extra CPU cycles due to deserialization.
By default, Spark uses Java's serialization. Since the Java serialization is slow, the better approach is to use Kryo
library. Kryo
is much faster and...