In this section, we will discuss some methods of organizing datasets, preprocessing into workflows, and then use the Apache Spark pipeline to represent as well as implement these workflows. Then, we will review data preprocessing automation solutions.
After this section, we will be able to use Spark pipelines to represent and implement datasets preprocessing workflows and understand some automation solutions made available by Apache Spark.
Our data preparation work from Data cleaning to Identity matching to Data re-organization to Feature extraction were organized in a way to reflect our step-by-step orderly process of preparing datasets for machine learning. In other words, all the data preparation work can be organized into a workflow.
Organizing data cleaning into workflows can help achieve repeatability and also possible automation, which is often the most valuable for machine learning professionals as ML professionals and data...