With word frequency analysis, we want to clean this data by removing the stop words, which would just clutter our interpretation. We'll explore the top overall word frequencies, then take a look at President Lincoln's work.
To get rid of stop words in a tidy format, you can use the stop_words
data frame provided in the tidytext
package. You call that tibble
into the environment, then do an anti-join by word:
> library(tidytext) > data(stop_words) > sotu_tidy <- sotu_unnest %>% dplyr::anti_join(stop_words, by = "word")
Notice that the length of the data went from 1.97 million observations down to 778,161. Now, you can go ahead and see the top words. I don't do it in the following, but you can put this into a data frame if you so choose:
> sotu_tidy %>% dplyr::count(word, sort = TRUE) # A tibble: 29,558 x 2 word n <chr> <int> 1 government 7573 2 congress 5759 3 united 5102...