Book Image

Mastering Data analysis with R

By : Gergely Daróczi
Book Image

Mastering Data analysis with R

By: Gergely Daróczi

Overview of this book

Table of Contents (19 chapters)
Mastering Data Analysis with R
Credits
www.PacktPub.com
Preface

Further cleanup


There are still some small disturbing glitches in the wordlist. Maybe, we do not really want to keep numbers in the package descriptions at all (or we might want to replace all numbers with a placeholder text, such as NUM), and there are some frequent technical words that can be ignored as well, for example, package. Showing the plural version of nouns is also redundant. Let's improve our corpus with some further tweaks, step by step!

Removing the numbers from the package descriptions is fairly straightforward, as based on the previous examples:

> v <- tm_map(v, removeNumbers)

To remove some frequent domain-specific words with less important meanings, let's see the most common words in the documents. For this end, first we have to compute the TermDocumentMatrix function that can be passed later to the findFreqTerms function to identify the most popular terms in the corpus, based on frequency:

> tdm <- TermDocumentMatrix(v)

This object is basically a matrix which...