Book Image

Learning Apache Spark 2

Book Image

Learning Apache Spark 2

Overview of this book

Apache Spark has seen an unprecedented growth in terms of its adoption over the last few years, mainly because of its speed, diversity and real-time data processing capabilities. It has quickly become the preferred choice of tool for many Big Data professionals looking to find quick insights from large chunks of data. This book introduces you to the Apache Spark framework, and familiarizes you with all the latest features and capabilities introduced in Spark 2. Starting with a detailed introduction to Spark’s architecture and the installation procedure, this book covers everything you need to know about the Spark framework in the most practical manner. You will learn how to perform the basic ETL activities using Spark, and work with different components of Spark such as Spark SQL, as well as the Dataset and DataFrame APIs for manipulating your data. Then, you will perform machine learning using Spark MLlib, as well as perform streaming analytics and graph processing using the Spark Streaming and GraphX modules respectively. The book also gives special emphasis on deploying your Spark models, and how they can be operated in a clustered mode. During the course of the book, you will come across implementations of different real-world use-cases and examples, giving you the hands-on knowledge you need to use Apache Spark in the best possible manner.
Table of Contents (18 chapters)
Learning Apache Spark 2
Credits
About the Author
About the Reviewers
www.packtpub.com
Customer Feedback
Preface

Performance tuning


Most of you would have heard of the old adage "Good, Fast, Cheap - Pick any two". That adage is still true, though the scales have shifted slightly with the open source model where the software is free but does need a relevant skillset to make the best use of it. That skillset comes at a cost, and performance tuning is one area where that specialized skillset is a must-have. When you talk about performance tuning, the underlying assumption is that your system is already working, fully functional.

Figure 11.1: Good - Fast and Cheap

You are not happy with the response rates. However, that does not have to be so all the time. You can take certain key decisions that can help you build a relatively optimized system early on.

So what are the key areas for consideration? Each distributed application has to work with five major computing resources:

  • Network
  • Disk
  • I/O
  • CPU
  • Memory

For an application to perform at its optimum level, it has to make sure it makes the best of all these resources. We'll look at areas that allow you to improve computation efficiency across all these resources. The key topics of concern are:

  • Data serialization
  • Memory tuning

Data serialization

Serialization is the process of converting an object into a sequence of bytes which can then be:

  • Persisted to disk
  • Saved to a database
  • Sent over the network

The reverse of converting bytes back to an object is therefore called Deserialization. As you can imagine Serialization and Deserialization are fairly common operations, especially during caching, persistence or shuffle operations in Spark. The speed of your application will depend on the serialization mechanism you choose, and those of you with a Java background will know that Java provides a Serializable interface. Java Serialization is the default serialization mechanism in Spark, but is not the fastest serialization mechanism around. The main reasons Java Serialization is slow are:

  • Java Serialization uses excessive temporary object allocation.
  • Java Serialization makes use of Reflection to get/set field values.

So while Java Serialization is flexible it is slow and hence you should use Kyro serialization which is an alternate serialization mechanism in Spark. You would need to set the new serialization in the Spark configuration.

Let us start by setting up the serializer:

conf.set( "spark.serializer",
"org.apache.spark.serializer.KyroSerializer")

For any network intensive application, it is recommended that you use the Kyro Serializer. Since Spark 2.0, the framework had used Kyro for all internal shuffling of RDDs with simple types, arrays of simple types and so on. However, you will still need to use your custom classes with Kyro, which can be done using registerKyroClasses method:

conf.registerKyroClasses(Array(classOf[MyCustomClass]))

The following are Spark related Kyro serialization properties, which you can set in the configuration object (http://bit.ly/2khlxCv).

Property Name

Default

Meaning

spark.kryo.classesToRegister

(none)

If you use Kryo serialization, give a comma-separated list of custom class names to register with Kryo. See the tuning guide for more details.

spark.kryo.referenceTracking

TRUE

Specifies whether to track references to the same object when serializing data with Kryo, which is necessary if your object graphs have loops and useful for efficiency if they contain multiple copies of the same object. Can be disabled to improve performance if you know this is not the case.

spark.kryo.registrationRequired

FALSE

Specifies whether to require registration with Kryo. If set to True, Kryo will throw an exception if an unregistered class is serialized. If set to false (the default), Kryo will write unregistered class names along with each object. Writing class names can cause significant performance overhead, so enabling this option can enforce that a user has not omitted classes from registration.

spark.kryo.registrator

(none)

If you use Kryo serialization, give a comma-separated list of classes that register your custom classes with Kryo. This property is useful if you need to register your classes in a custom way, and so on to specify a custom field serializer. Otherwise spark.kryo.classesToRegister is simpler. It should be set to classes that extend KryoRegistrator.

spark.kryo.unsafe

FALSE

Specifies whether to use unsafe based Kryo serializer. Can be substantially faster by using Unsafe Based IO.

spark.kryoserializer.buffer.max

64m

Maximum allowable size of Kryo serialization buffer. This must be larger than any object you attempt to serialize. Increase this if you get a buffer limit exceeded exception inside Kryo.

spark.kryoserializer.buffer

64k

Initial size of Kryo's serialization buffer. Note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed.

Memory tuning

Data serialization is key during all persistence and shuffle operations, but since Spark is an in-memory engine, you can expect that memory tuning will play a key part in your application's performance. Spark has improved since its earlier versions in terms of handling contention between various elements within the framework for memory, which is a scarce resource (certainly scarcer than disks). The three major types of contentions include:

  • Execution and storage
  • Tasks running in parallel
  • Operators within the same task
  • Memory management configuration options
  • Memory tuning key tips

Execution and storage

Memory required for execution of tasks, for example shuffles, joins, sorts and aggregations is called execution memory, whereas memory required for caching data sets and propagating internal data across the cluster is known as Storage memory. Since Spark 1.6 with the advent of unified memory management, memory is shared between execution and storage, which means that if execution requires more memory, it can get it from storage. As we know that both execution and storage can grow, while the need for more execution memory can evict storage from memory using the LRU mechanism, the need for storage will not evict execution memory.

Tasks running in parallel

Since Spark 1.0, dynamic memory assignment is considered where, if you have an N number of tasks running, the memory will be divided among the tasks. If you are running a very memory intensive task, say Task1, it will consume all the available memory. If another task, say Task2 is scheduled, Task1 will need to spill some of its data to disk to ensure that Task2 gets fair share of the resources.

Operators within the same task

If you have a number of operators within the task, since Spark 1.6, they will use cooperative spilling to share memory resources. For example, if you are running a query that needs to aggregate data before it sorts it, the aggregate operator would get all the available memory. However, when the sort operator runs, it will ask the aggregate to share some memory, which might result in aggregate spilling some pages.

Memory management configuration options

Spark provides a number of memory management configuration options documented on the Apache Spark configuration page at http://bit.ly/2kgDDtk.

Memory tuning key tips

The following are some tips to help you understand memory management within Spark:

  1. The best way to understand your storage requirements is to either look at the Spark Web UI to see how a cache object consumes memory: or for specific objects, you can use the SizeEstimator() (http://bit.ly/sizeestimator) estimate method.
  2. If you have less than 32 GB of RAM you can conserve memory by making your pointers only 4 bytes instead of 8 using the JVM -XX:+UseCompressOops flag. This enables the use of compressed 32 bit OOPs in a 64-bit JVM without scarifying the heap size advantage. These can be set in the spark-env.sh file.

More memory tips can be found at http://bit.ly/2lkV4s6. If you would like to spend some more time in understanding memory management within Spark, I urge you to have a look at the following talk: http://bit.ly/2lkZorEx`.