Summary
In this chapter, you learned how to preprocess textual and categorical nominal and ordinal data using state-of-the-art NLP techniques.
You can now build a classical NLP pipeline with stop word removal, lemmatization and stemming, n-grams, and count term occurrences using a bag-of-words model. We used SVD to reduce the dimensionality of the resulting feature vector and to generate lower-dimensional topic encoding. One important tweak to the count-based bag-of-words model is to compare the relative term frequencies of a document. You learned about the TF-IDF function and can use it to compute the importance of a word in a document compared to the corpus.
In the following section, we looked at Word2Vec and GloVe, which are pretrained dictionaries of numeric word embeddings. Now you can easily reuse a pretrained word embedding for commercial NLP applications with great improvements and accuracy due to the semantic embedding of words.
Finally, we finished the chapter by...