Machine learning professionals and data scientists often spend 70% or 80% of their time preparing data for their machine learning projects. Data preparation can be very hard work, but it is necessary and extremely important as it affects everything to follow. Therefore, in this chapter, we will cover all the necessary data preparation parts for our machine learning, which often runs from data accessing, data cleaning, datasets joining, and then to feature development so as to get our datasets ready to develop ML models on Spark. Specifically, we will discuss the following six data preparation tasks mentioned before and then end our chapter with a discussion of repeatability and automation:
Accessing and loading datasets
Publicly available datasets for ML
Loading datasets into Spark easily
Exploring and visualizing data with Spark
Data cleaning
Dealing with missing cases and incompleteness
Data cleaning on Spark
Data cleaning made easy
Identity matching
Dealing...