Book Image

Natural Language Processing with Java and LingPipe Cookbook

Book Image

Natural Language Processing with Java and LingPipe Cookbook

Overview of this book

Table of Contents (14 chapters)
Natural Language Processing with Java and LingPipe Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Finding words for languages without white spaces


Languages such as Chinese do not have word boundaries. For example, 木卫三是围绕木星运转的一颗卫星,公转周期约为7天 from Wikipedia is a sentence in Chinese that translates roughly into "Ganymede is running around Jupiter's moons, orbital period of about seven days" as done by the machine translation service at https://translate.google.com. Notice the absence of white spaces.

Finding tokens in this sort of data requires a very different approach that is based on character-language models and our spell-checking class. This recipe encodes finding words by treating untokenized text as misspelled text, where the correction inserts a space to delimit tokens. Of course, there is nothing misspelled about Chinese, Japanese, Vietnamese, and other non-word delimiting orthographies, but we have encoded it in our spelling-correction class.

Getting ready

We will approximate non-word delimiting orthographies with de-white spaced English. This is sufficient to understand the recipe...