Book Image

Natural Language Processing with Java and LingPipe Cookbook

Book Image

Natural Language Processing with Java and LingPipe Cookbook

Overview of this book

Table of Contents (14 chapters)
Natural Language Processing with Java and LingPipe Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Modifying tokenizer factories


In this recipe, we will describe a tokenizer that modifies the tokens in the token stream. We will extend the ModifyTokenTokenizerFactory class to return text that is rotated by 13 places in the English alphabet, also known as rot-13. Rot-13 is a very simple substitution cipher, which replaces a letter with the letter that follows after 13 places. For example, the letter a will be replaced by the letter n, and the letter z will be replaced by the letter m. This is a reciprocal cypher, which means that applying the same cypher twice recovers the original text.

How to do it...

We will invoke the Rot13TokenizerFactory class from the command line:

java -cp "lingpipe-cookbook.1.0.jar:lib/lingpipe-4.1.0.jar" com.lingpipe.cookbook.chapter2.Rot13TokenizerFactory

type a sentence below to see the tokens and white spaces:
Move along, nothing to see here.
Token:'zbir'
Token:'nybat'
Token:','
Token:'abguvat'
Token:'gb'
Token:'frr'
Token:'urer'
Token:'.'
Modified Output: zbir...