Apache Spark V1.5 introduced code generation for expression evaluation. In order to understand this, let's start with an example. Let's have a look at the following expression:
val i = 23 val j = 5 var z = i*x+j*y
Imagine x
and y
are data coming from a row in a table. Now, consider that this expression is applied for every row in a table of, let's say, one billion rows. Now the Java Virtual Machine has to execute (interpret) this expression one billion times, which is a huge overhead. So what Tungsten actually does is transform this expression into byte-code and have it shipped to the executor thread.
As you might know, every class executed on the JVM is byte-code. This is an intermediate abstraction layer to the actual machine code specific for each different micro-processor architecture. This was one of the major selling points of Java decades ago. So the basic workflow is:
- Java source code gets compiled into Java byte-code.
- Java byte-code gets interpreted by the JVM.
- The JVM...