In this section, we discuss news mining in R. We start with a successful document classification and then discuss how to collect news articles directly from R.
In this section, we examine a particular dataset which features a term-document matrix of 2,071 press articles containing the word flu in their title. The articles were found on LexisNexis using this search term in two newspapers, The New York Times and The Guardian, between January 1980 and May 2013. For copyright reasons, we cannot include the original articles here. These have been preprocessed in a similar way to what we have seen before with another software, Rapidminer 5. In addition to the term-document matrix, the type of seasonal flu versus other (avian and swine flu)–is included in the first column of the data frame (the SEASONAL.FLU
attribute). When articles discussed seasonal flu and other strands, they were coded as other (value 0). Terms were coded as present...