You can use regular expression matching to tag words. For example, you can match numbers with \d
to assign the tag CD (which refers to a Cardinal number). Or you could match on known word patterns, such as the suffix "ing". There's a lot of flexibility here, but be careful of over-specifying since language is naturally inexact, and there are always exceptions to the rule.
For this recipe to make sense, you should be familiar with the regular expression syntax and Python's re
module.
The RegexpTagger
class expects a list of two tuples, where the first element in the tuple is a regular expression and the second element is the tag. The patterns shown in the following code can be found in tag_util.py
:
patterns = [ (r'^\d+$', 'CD'), (r'.*ing$', 'VBG'), # gerunds, i.e. wondering (r'.*ment$', 'NN'), # i.e. wonderment (r'.*ful$', 'JJ') # i.e. wonderful ]
Once you've constructed this list of patterns, you can pass it into RegexpTagger...