The topic modeling
refers to the process of identifying hidden patterns in text data. The goal is to uncover some hidden thematic structure in a collection of documents. This will help us in organizing our documents in a better way so that we can use them for analysis. This is an active area of research in NLP. You can learn more about it at http://www.cs.columbia.edu/~blei/topicmodeling.html. We will use a library called gensim
during this recipe. Make sure that you install this before you proceed. The installation steps are given at https://radimrehurek.com/gensim/install.html.
Create a new Python file and import the following packages:
from nltk.tokenize import RegexpTokenizer from nltk.stem.snowball import SnowballStemmer from gensim import models, corpora from nltk.corpus import stopwords
Define a function to load the input data. We will use the
data_topic_modeling.txt
text file that is already provided to you:# Load input...