spark.ml package uses the dataframe for ML workflows, depending on the use case one might need to extract data from raw dataframe or transform the dataframe in a format as required by the ML algorithms or at times one might just need a few selected parameters as feature vectors. All these different types of operations require usage of specially developed APIs that can be clubbed into the following categories.
When the data present in a raw dataframe are not explicitly present in the form an ML algorithm expects we use feature extractors to extract those features. Common feature extractors are:
- CountVectorizer: A
CountVectorizerconverts a collection of text documents into a vector representing the word count of text documents.
CountVectorizerworks in two different ways depending how the value of the dictionary gets populated. Let's first assume that the user has no prior information of the type of data that will populate the dataset...