In this chapter, we discussed feature extraction. We learned several techniques for creating representations of data that can be used by machine learning algorithms. First, we created features from categorical explanatory variables using one-hot encoding and scikit-learn's DictVectorizer
. We learned to standardize data to ensure that our estimators can learn from all of the features and can converge as quickly as possible.
Second, we extracted features from one of the most common types of data used in machine learning problems: text. We worked through several variations of the bag-of-words model, which discards all syntax and encodes only the frequencies of the tokens in a document. We began by creating basic binary term frequencies with CountVectorizer
. We learned to preprocess text by filtering stop-words and stemming tokens, and replaced the term counts in our feature vectors with tf-idf weights that penalize common words and normalize for documents of different lengths. We then...