Spark MLlib's goal is to make practical ML scalable and easy. Similar to Spark Core, MLlib provides APIs in three languages that is, Python, Scala, and Java-with example code which will ease the learning curve for users coming from different backgrounds. The pipeline API in MLlib provides a uniform set of high-level APIs built on top of DataFrames that helps users create and tune practical ML pipelines. This API is under a new package with name spark.ml
.
MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline or workflow. Let's see the key terms introduced by the pipeline API:
DataFrame: The ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. For example, a DataFrame could have different columns storing text, feature vectors, true labels and predictions.
Transformer: A transformer is an algorithm which can transform one DataFrame into another DataFrame...