Spark MLlib is a general purpose machine learning library that gives all the benefits of Spark, that is, distributed computing, scalability, and fault tolerance along with easy inter-operability among different Spark modules and other libraries. Machine learning is not a new concept and certainly not solely developed by Spark, what makes Spark MLlib stand out on its own is its ease of use and generalization in developing any ML algorithm using pipeline. Again, pipeline as a concept has been used by the scikit-learn library and Apache Spark has done a brilliant job by using the same concept, but in a distributed mode. Generally, Spark's machine learning module ships:
- Common machine learning algorithms.
- Tools to load, extract, transform, and select features.
- The ability to chain multiple operations using pipeline.
- The ability to save and load algorithms, models, and pipelines.
- The capability of performing linear algebra and statistical operations.
Over the years...