Cross-document coreference (XDoc) takes the id
space of an individual document and makes it global to a larger universe. This universe typically includes other processed documents and databases of known entities. While the annotation is trivial, all that one needs to do is swap the document-scope IDs for the universe-scope IDs. The calculation of XDoc can be quite difficult.
This recipe will tell us how to use a lightweight implementation of XDoc developed over the course of deploying such systems over the years. We will provide a code overview for those who might want to extend/modify the code—but there is a lot going on, and the recipe is quite dense.
The input is in the XML format where each file can contain multiple documents:
<doc id="1"> <title/> <content> Breck Baldwin and Krishna Dayanidhi wrote a book about LingPipe. </content> </doc> <doc id="2"> <title/> <content> Krishna Dayanidhi is a developer. Breck...