One of the major attractions of Spark is its ability to scale computations massively, and this is exactly what you need for machine learning algorithms. But the caveat is that all machine learning algorithms cannot be effectively parallelized. Each algorithm has its own challenges for parallelization, whether it is task parallelism or data parallelism. Having said that, Spark is becoming the de-facto platform for building machine learning algorithms and applications. Spark 2.0.0 has come a long way since version 1.1.0, with more algorithms and interesting APIs. For the latest information on this, you can refer to the Spark site at https://spark.apache.org/docs/latest/ml-guide.html, which is the authoritative source.
In this chapter, we will first cover machine learning interfaces and organization, including the new ml pipeline, which has become mainstream in 2.0.0. Then, we will delve into the following machine learning algorithms:
Basic...