Spark ML is an API built on top of the DataFrames API of Spark SQL to construct machine learning pipelines. Spark ML is inspired by the scikit-learn project, which makes it easier to combine multiple algorithms into a single pipeline. The following are the concepts used in ML pipelines:
DataFrame: A DataFrame is used to create rows and columns of data just like an RDBMS table. A DataFrame can contain text, feature vectors, true labels, and predictions in columns.
Transformer: A Transformer is an algorithm to transform a DataFrame into another DataFrame. The ML model is an example of a Transformer that transforms a DataFrame with features into a DataFrame with predictions.
Estimator: This is an algorithm to produce a Transformer by fitting on a DataFrame. Generating a model is an example of an Estimator.
Pipeline: As the name indicates, a pipeline creates a workflow by chaining multiple Transformers and Estimators together.