Understanding the Corpus object
The Corpus
object is the main object that stores corpora in memory in Flair. Each Corpus
is a collection of three datasets that behave like lists of Sentence
objects. These datasets can be accessed via the following properties:
- The train property, which contains the dataset that will be used for training models.
- The test property, which contains a dataset that's independent of the train dataset. It is used for model validation.
- The dev property, which contains the dataset that's used for hyperparameter tuning.
These three datasets ideally contain data from the same data source and follow the same probability distribution.
An example corpus object can be obtained by loading one of Flair's prepared datasets:
from flair import datasets corpus = datasets.UD_ENGLISH()
The corpus summary can be obtained by simply printing out the object:
print(corpus)
This should print out the following:
Corpus: 12543...