A text corpus is text data that forms out of a single document or group of documents and can come from any language, such as English, German, Hindi, and so on. In today's world, most of the textual data flows from social media, such as Facebook, Twitter, blogging sites, and other platforms. Mobile applications have now been added to the list of such sources. The larger size of a corpus, which is called corpora, makes the analytics more accurate.
A corpus can be broken into units, which are called sentences. Sentences hold the meaning and context of the corpus, once we combine them together. Sentence formation takes place with the help of parts of speech. Every sentence is separated from other sentences by a delimiter, such as a period, which we can make use of to break it up further. This is called sentence tokenization.