Book Image

Spark Cookbook

By : Rishi Yadav
Book Image

Spark Cookbook

By: Rishi Yadav

Overview of this book

Table of Contents (19 chapters)
Spark Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Using serialization to improve performance


Serialization plays an important part in distributed computing. There are two persistence (storage) levels, which support serializing RDDs:

  • MEMORY_ONLY_SER: This stores RDDs as serialized objects. It will create one byte array per partition

  • MEMORY_AND_DISK_SER: This is similar to the MEMORY_ONLY_SER, but it spills partitions that do not fit in the memory to disk

The following are the steps to add appropriate persistence levels:

  1. Start the Spark shell:

    $ spark-shell
    
  2. Import the StorageLevel and implicits associated with it:

    scala> import org.apache.spark.storage.StorageLevel._
    
  3. Create an RDD:

    scala> val words = sc.textFile("words")
    
  4. Persist the RDD:

    scala> words.persist(MEMORY_ONLY_SER)
    

Though serialization reduces the memory footprint substantially, it adds extra CPU cycles due to deserialization.

By default, Spark uses Java's serialization. Since the Java serialization is slow, the better approach is to use Kryo library. Kryo is much faster and...