Application-specific types of preprocessing
The preprocessing topics we have covered in the previous sections are generally applicable to many types of text in many applications. Additional preprocessing steps can also be used in specific applications, and we will cover these in the next sections.
Substituting class labels for words and numbers
Sometimes data includes specific words or tokens that have equivalent semantics. For example, a text corpus might include the names of US states, but for the purposes of the application, we only care that some state was mentioned – we don’t care which one. In that case, we can substitute a class token for the specific state name. Consider the interaction in Figure 5.10:
Figure 5.10 – Class token substitution
If we substitute the class token, <state_name>
, for Texas
, all of the other state names will be easier to recognize, because instead of having to learn 50 states, the system...