Tokenizing text
You can see that the first step in this pipeline is tokenizing – what exactly is this?
Tokenization is the task of splitting a text into meaningful segments, called tokens. These segments could be words, punctuation, numbers, or other special characters that are the building blocks of a sentence. In spaCy, the input to the tokenizer is a Unicode text, and the output is aDoc
object [19].
Different languages will have different tokenization rules. Let's look at an example of how tokenization might work in English. For the sentence – Let us go to the park., it's quite straightforward, and would be broken up as follows, with the appropriate numerical indices:
0 | 1 | 2 | 3 | 4 | 5 | 6 |
Let | us | go | to | the | park | . |
This looks awfully like the result when we just run text.split(' ')
– when does tokenizing involve more effort?
If the previous sentence was Let's go to the park. instead, the tokenizer would have to be smart enough to split Let's into Let and 's. This means that there are some special rules to follow...