In real-world use cases, it is not very easy to get the raw data in the appropriate form of features and labels in order to train the model. Doing lots of pre-processing is very common. Unlike other data processing paradigms, Spark in conjunction with the Spark machine learning library provides a comprehensive set of tools and algorithms for this purpose. This pre-processing algorithms can be put into three categories:
Feature extraction
Feature transformation
Feature selection
The process of extracting the features from the raw data is feature extraction. The HashingTF that was used in the preceding use case is a good example of an algorithm that converts terms of text data to feature vectors. The process of transforming features into different formats is feature transformation. The process of selecting a subset of features from a super set is feature selection. Covering all these is beyond the scope of this chapter, but the next section is going to discuss an Estimator,...