This chapter will tell us how to work with spans of text that typically cover one or more words/tokens. The LingPipe API represents this unit of text as a chunk with corresponding chunkers that produce chunkings. The following is some text with character offsets indicated:
LingPipe is an API. It is written in Java. 012345678901234567890123456789012345678901 1 2 3 4
Chunking the preceding text into sentences will give us the following output:
Sentence start=0, end=18 Sentence start =20, end=41
Adding in a chunking for named entities adds entities for LingPipe and Java:
Organization start=0, end=7 Organization start=37, end=40
We can define the named-entity chunkings with respect to their offsets from the sentences that contain them; this will make no difference to LingPipe, but Java will be:
Organization start=17, end=20
This is the basic idea of chunks. There are lots of ways to make them.