Spark has a more limited API than Java and Scala, but supports most of the core functionalities.
The hallmarks of a MapReduce system are the two commands: map
and reduce
. You've seen the map
function used in the past chapters. The map
function works by taking in a function that works on each individual element in the input RDD and produces a new output element. For example, to produce a new RDD where you have added one to every number, you would use rdd.map(lambda x: x+1)
. It's important to understand that the map
function and the other Spark functions do not transform the existing elements, rather they return a new RDD with the new elements. The reduce
function takes a function that operates on pairs to combine all the data. This is returned to the calling program. If you were to sum all the elements, you would use rdd.reduce(lambda x, y: x+y)
.
The flatMap
function is a useful utility that allows you to write a function which returns an Iterable
object of the...