The WordListCorpusReader
class is one of the simplest CorpusReader
classes. It provides access to a file containing a list of words, one word per line. In fact, you've already used it when we used the stopwords corpus in Chapter 1, Tokenizing Text and WordNet Basics, in the Filtering stopwords in a tokenized sentence and Discovering word collocations recipes.
We need to start by creating a wordlist file. This could be a single column CSV file, or just a normal text file with one word per line. Let's create a file named wordlist
that looks like this:
nltk corpus corpora wordnet
Now we can instantiate a WordListCorpusReader
class that will produce a list of words from our file. It takes two arguments: the directory path containing the files, and a list of filenames. If you open the Python console in the same directory as the files, then '.'
can be used as the directory path. Otherwise, you must use a directory path such as nltk_data/corpora...