Before we proceed to detailed algorithms, let's look at a generic text-processing pipeline depicted in Figure 9-1. In text analysis, the input is usually presented as a stream of characters (depending on the specific language).
Lexical analysis has to do with breaking this stream into a sequence of words (or lexemes in linguistic analysis). Often it is also called tokenization (and the words called the tokens). ANother Tool for Language Recognition (ANTLR) (http://www.antlr.org/) and Flex (http://flex.sourceforge.net) are probably the most famous in the open source community. One of the classical examples of ambiguity is lexical ambiguity. For example, in the phrase I saw a bat. bat can mean either an animal or a baseball bat. We usually need context to figure this out, which we will discuss next:
Syntactic analysis, or parsing, traditionally deals with matching the structure of the text with grammar rules. This is relatively...