Corpora can be in many different formats. In practice, we can use the following file formats. All these file formats are generally used to store features, which we will feed into our machine learning algorithms later. Practical stuff regarding dealing with the following file formats will be incorporated from Chapter 4, Preprocessing onward. Following are the aforementioned file formats:
- .txt: This format is basically given to us as a raw dataset. The gutenberg corpus is one of the example corpora. Some of the real-life applications have parallel corpora. Suppose you want to make Grammarly a kind of grammar correction software, then you will need a parallel corpus.
- .csv: This kind of file format is generally given to us if we are participating in some hackathons or on Kaggle. We use this file format to save our features, which we will...