The previous section analyzed the entity field of a tweet. This provides useful knowledge on the tweet, because these entities are explicitly curated by the author of the tweet. This section will focus on unstructured data instead, that is, the raw text of the tweet. We'll discuss aspects of text analytics such as text preprocessing and normalization and we'll perform some statistical analysis on the tweets. Before digging the details, we'll introduce some terminology.
Tokenization is one of the important steps in the preprocessing phase. Given a stream of text (such as a tweet status), tokenization is the process of breaking this text down into individual units called tokens. In the simplest form, these units are words, but we could also work on a more complex tokenization that deals with phrases, symbols, and so on.
Tokenization sounds like a trivial task, and it's been widely studied by the natural language processing community. Chapter 1, Social Media...