In this section, we will start by preprocessing the corpus for analysis and then inspecting it. We will then build the training and testing data frames.
We can see that the joint corpus contains 2,000 documents as we requested. We can now perform the steps we discussed in the preceding section. We will build a function that performs them all at once for this purpose (we will use this function again later in the chapter):
1 install.packages("SnowballC") 2 preprocess = function(corpus, stopwrds = 3 stopwords("english")){ 4 library(SnowballC) 5 corpus = tm_map(corpus, content_transformer(tolower)) 6 corpus = tm_map(corpus, removePunctuation) 7 corpus = tm_map(corpus, 8 content_transformer(removeNumbers)) 9 corpus = tm_map(corpus, removeWords, stopwrds) 10 corpus = tm_map(corpus, stripWhitespace) 11 corpus = tm_map(corpus, stemDocument) 12 corpus 13 }
Let's run the function on our corpus:
processed...