Introducing Pipelines
The Pipeline
class helps to sequence, or streamline, the execution of separate blocks that lead to an estimated model; it chains multiple Transformers and Estimators to form a sequential execution workflow.
Pipelines are useful as they avoid explicitly creating multiple transformed datasets as the data gets pushed through different parts of the overall data transformation and model estimation process. Instead, Pipelines abstract distinct intermediate stages by automating the data flow through the workflow. This makes the code more readable and maintainable as it creates a higher abstraction of the system, and it helps with code debugging.
In this recipe, we will streamline the execution of a generalized linear regression model.
Getting ready
To execute this recipe, you will need a working Spark environment and you would have already loaded the data into the forest
DataFrame.
No other prerequisites are required.
How to do it...
The following code provides a streamlined version...