Book Image

Spark Cookbook

By : Rishi Yadav
Book Image

Spark Cookbook

By: Rishi Yadav

Overview of this book

Table of Contents (19 chapters)
Spark Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Optimizing garbage collection


JVM garbage collection can be a challenge if you have a lot of short lived RDDs. JVM needs to go over all the objects to find the ones it needs to garbage collect. The cost of the garbage collection is proportional to the number of objects the GC needs to go through. Therefore, using fewer objects and the data structures that use fewer objects (simpler data structures, such as arrays) helps.

Serialization also shines here as a byte array needs only one object to be garbage collected.

By default, Spark uses 60 percent of the executor memory to cache RDDs and the rest 40 percent for regular objects. Sometimes, you may not need 60 percent for RDDs and can reduce this limit so that more space is available for object creation (less need for GC).

How to do it…

You can set the memory allocated for RDD cache to 40 percent by starting the Spark shell and setting the memory fraction:

$ spark-shell --conf spark.storage.memoryFraction=0.4