We have dealt with word tokenization previously, but we can perform this using NLTK as well as sentence tokenization, which is quite tricky, as the English language has period symbols for abbreviations and other purposes. Thankfully, the sentence tokenizer is a instance of PunktSentenceTokenizer from the tokenize.punkt
module of nltk
, which helps in tokenizing sentences.
Let's look at word tokenization using this code:
>>> #Loading the forbes data >>> data = open('./Data/madmax_review/forbes.txt','r').read() >>> word_data = nltk.word_tokenize(data) >>> word_data[:15] ['Pundits', 'and', 'critics', 'like', 'to', 'blame', 'the', 'twin', 'successes', 'of', 'Jaws', 'and', 'Star', 'Wars', 'for']
Now, let's perform the sentence tokenization of the Forbes article:
>>> sent_tokenize(data)[:5] ['Pundits and critics like to blame the twin successes of Jaws and Star Wars for turning Hollywood into something of...