Preparing natural language datasets
For the CV algorithms in the previous chapter, data preparation focused on the technical format required for the dataset (Image format, RecordIO, or augmented manifest). The images themselves weren't processed.
Things are quite different for NLP algorithms. Text needs to be heavily processed, converted, and saved in the right format. In most learning resources, these steps are abbreviated or even ignored. Data is already "automagically" ready for training, leaving the reader frustrated and sometimes dumbfounded on how to prepare their own datasets.
No such thing here! In this section, you'll learn how to prepare NLP datasets in different formats. Once again, get ready to learn a lot!
Let's start with preparing data for BlazingText.
Preparing data for classification with BlazingText
- A plain text file, with one sample per line. ...