-
Book Overview & Buying
-
Table Of Contents
GPU-Accelerated Computing with Python 3 and CUDA
By :
This section explains how to obtain the example text dataset and prepare it by tokenizing the individual samples.
To build our own language model, we need a text dataset that covers a wide range of general topics. Since training on a very large dataset typically requires a considerable amount of GPU computations, which may not be suitable for the educational scope of this chapter, we will work with a smaller, more manageable dataset. We will use Hugging Face (HG), an open source platform for working with LLMs and datasets, to download large movie reviews and a prebuilt tokenizer. It contains a total of 100,000 movie reviews that fit well for our text generation task. We can simply download this dataset from the HG data hub using just a few lines:
from datasets import load_dataset
dataset = load_dataset("stanfordnlp/imdb")
The load_dataset function allows us to easily download and load a wide variety of text datasets. stanfordnlp...