Removal of short words can also be useful in removing noise words from the content. The following examines removing words of a certain length or shorter. It also demonstrates the opposite by selecting the words not considered short (having a length of more than the specified short word length).
Identifying and removing rare words
How to do it
We can leverage the frequency distribution from NLTK to efficiently calculate the short words. We could just scan all of the words in the source, but it is simply more efficient to scan the lengths of all of the keys in the resulting distribution as it will be a significantly smaller set of data:
- The script in the 07/08_short_words.py file exemplifies this process. It starts by loading...