The mechanical Turk
Data classification is a supervised learning technique. This means that you can only predict the labels and categories you have learned from a training dataset. Because the latter has to be properly labeled, this becomes the main challenge which we will be addressing in this chapter.
Human intelligence tasks
None of our data, within the context of news articles, has been properly labeled upfront; there is strictly nothing we can learn out of it. Common sense for data scientists is to start labeling some input records manually, records that will serve as a training dataset. However, because the number of classes may be relatively large, at least in our case (hundreds of labels), the amount of data to label could be significant (thousands of articles) and would require tremendous effort. A first solution is to outsource this laborious task to a "Mechanical Turk", the term being used as reference to one of the most famous hoaxes in history where an automated chess player fooled...