This recipe consumes the URL stream, downloading the document content and deriving a clean stream of terms that are suitable for later analysis. A clean term is defined as a word that:
Is not a stop word
Is a valid dictionary word
Is not a number or URL
Is a lemma
A lemma is the canonical form of a word; for example, run, runs, ran, and running are forms of the same lexeme with "run" as the lemma. Lexeme, in this context, refers to the set of all the forms that have the same meaning, and lemma refers to the particular form that is chosen by convention to represent the lexeme.
The lemma is important for this recipe because it enables us to group terms that have the same meaning. Where their frequency of occurrence is important, this grouping is important.