The theme modeling refers to the procedure of recognizing hidden patterns in manuscript information. The objective is to expose some hidden thematic configuration in a collection of documents.
- Import the following packages:
from nltk.tokenize import RegexpTokenizer from nltk.stem.snowball import SnowballStemmer from gensim import models, corpora from nltk.corpus import stopwords
- Load the input data:
def load_words(in_file): element = [] with open(in_file, 'r') as f: for line in f.readlines(): element.append(line[:-1]) return element
- Class to pre-process text:
classPreprocedure(object):
def __init__(self):
# Create a regular expression tokenizer
self.tokenizer = RegexpTokenizer(r'w+')
- Obtain a list of stop words to terminate the program execution:
self.english_stop_words= stopwords.words('english')
- Create a Snowball stemmer:
self.snowball_stemmer = SnowballStemmer('english')
- Define a function to perform...