Often, data scientists need to deal with unstructured data such as free-flow text: companies receive feedback or recommendations (among other things) from customers that can be a gold mine for predicting a customer's next move or their sentiment toward a brand.
In this recipe, we will learn how to extract features from text.
To execute this recipe, you will need a working Spark environment.
No other prerequisites are required.
A general process that aims to extract data from text and convert it into something a machine learning model can use starts with the free-flow text. The first step is to take each sentence of the text and split it on the space character (most often). Next, all the stop words are removed. Finally, simply counting distinct words in the text or using a hashing trick takes us into the realm of numerical representations of free-flow text.
Here's how to achieve this with Spark's ML module:
some_text = spark.createDataFrame...