-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating
Advanced Natural Language Processing with TensorFlow 2
By :
The Gigaword dataset has been already cleaned, normalized, and tokenized using the StanfordNLP tokenizer. All the data is converted into lowercase and normalized using the StanfordNLP tokenizer, as seen in the preceding examples. The main task in this step is to create a vocabulary. A word-based tokenizer is the most common choice in summarization. However, we will use a subword tokenizer in this chapter. A subword tokenizer provides the benefit of limiting the size of the vocabulary while minimizing the number of unknown words. Chapter 3, Named Entity Recognition (NER) with BiLSTMs, CRFs, and Viterbi Decoding, on BERT, described different types of tokenizers. Consequently, models such specifically the part as BERT and GPT-2 use some variant of a subword tokenizer. The tfds package provides a way for us to create a subword tokenizer, initialized from a corpus of text. Since generating the vocabulary requires running it over all of the training data...
Change the font size
Change margin width
Change background colour