-
Book Overview & Buying
-
Table Of Contents
Spark Cookbook
By :
Serialization plays an important part in distributed computing. There are two persistence (storage) levels, which support serializing RDDs:
MEMORY_ONLY_SER: This stores RDDs as serialized objects. It will create one byte array per partition
MEMORY_AND_DISK_SER: This is similar to the MEMORY_ONLY_SER, but it spills partitions that do not fit in the memory to disk
The following are the steps to add appropriate persistence levels:
Start the Spark shell:
$ spark-shell
Import the StorageLevel and implicits associated with it:
scala> import org.apache.spark.storage.StorageLevel._
Create an RDD:
scala> val words = sc.textFile("words")
Persist the RDD:
scala> words.persist(MEMORY_ONLY_SER)
Though serialization reduces the memory footprint substantially, it adds extra CPU cycles due to deserialization.
By default, Spark uses Java's serialization. Since the Java serialization is slow, the better approach is to use Kryo library. Kryo is much faster and...
Change the font size
Change margin width
Change background colour