Sometimes, it is necessary to cluster text documents into buckets based on their content.
In this recipe, we will walk through an example of assigning a topic to a set of short paragraphs extracted from Wikipedia.
To execute this recipe, you will need a working Spark environment.
No other prerequisites are required.
In order to cluster the documents, we first need to extract the features from our articles. Note that the following text is abbreviated for space considerations—refer to the GitHub repository for the full code:
articles = spark.createDataFrame([ (''' The Andromeda Galaxy, named after the mythological Princess Andromeda, also known as Messier 31, M31, or NGC 224, is a spiral galaxy approximately 780 kiloparsecs (2.5 million light-years) from Earth, and the nearest major galaxy to the Milky Way. Its name stems from the area of the sky in which it appears, the constellation of Andromeda...