Book Image

Natural Language Processing with Java and LingPipe Cookbook

Book Image

Natural Language Processing with Java and LingPipe Cookbook

Overview of this book

Table of Contents (14 chapters)
Natural Language Processing with Java and LingPipe Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Cross-document coreference


Cross-document coreference (XDoc) takes the id space of an individual document and makes it global to a larger universe. This universe typically includes other processed documents and databases of known entities. While the annotation is trivial, all that one needs to do is swap the document-scope IDs for the universe-scope IDs. The calculation of XDoc can be quite difficult.

This recipe will tell us how to use a lightweight implementation of XDoc developed over the course of deploying such systems over the years. We will provide a code overview for those who might want to extend/modify the code—but there is a lot going on, and the recipe is quite dense.

The input is in the XML format where each file can contain multiple documents:

<doc id="1">
<title/>
<content>
Breck Baldwin and Krishna Dayanidhi wrote a book about LingPipe. 
</content>
</doc>

<doc id="2">
<title/>
<content>
Krishna Dayanidhi is a developer. Breck...