In this recipe, we show how to handle text data with scikit-learn. Working with text requires careful preprocessing and feature extraction. It is also quite common to deal with highly sparse matrices.
We will learn to recognize whether a comment posted during a public discussion is considered insulting to one of the participants. We will use a labeled dataset from Impermium, released during a Kaggle competition.
Download the Troll dataset from the book's GitHub repository at https://github.com/ipython-books/cookbook-data.
This dataset was obtained from Kaggle, at www.kaggle.com/c/detecting-insults-in-social-commentary.