Our goal here is to build a classifier to predict Presidential party affiliation, either Democrat or Republican, since 1900. We will turn the word counts per year into features, create a DTM, create features using the term frequency-inverse document frequency (tf-idf), and use them in our model. As you can imagine, we will have thousands of features, so we will change how the data is prepared versus what we covered in prior sections, and also use the text2vec
package for feature creation and modeling.
We'll start by getting the pertinent data period. Then, we'll take a look at a table of the labels:
> sotu_party <- sotu_meta %>% dplyr::filter(year > 1899) > table(sotu_party$party) Democratic Republican 61 64
The class is well balanced.
A few things can help in the modeling process. It is a good idea here to remove numbers, remove capitalization, remove stop words, stem the words, and remove punctuation. The built-in functions...