Book Image

Mastering Data analysis with R

By : Gergely Daróczi
Book Image

Mastering Data analysis with R

By: Gergely Daróczi

Overview of this book

Table of Contents (19 chapters)
Mastering Data Analysis with R
Credits
www.PacktPub.com
Preface

Cleaning the corpus


One of the nicest features of the tm package is the variety of bundled transformations to be applied on corpora (corpuses). The tm_map function provides a convenient way of running the transformations on the corpus to filter out all the data that is irrelevant in the actual research. To see the list of available transformation methods, simply call the getTransformations function:

> getTransformations()
[1] "as.PlainTextDocument" "removeNumbers"
[3] "removePunctuation"    "removeWords"
[5] "stemDocument"         "stripWhitespace"     

We should usually start with removing the most frequently used, so called stopwords from the corpus. These are the most common, short function terms, which usually carry less important meanings than the other expressions in the corpus, especially the keywords. The package already includes such lists of words in different languages:

> stopwords("english")
  [1] "i"          "me"         "my"         "myself"     "we"        
  [6] "our...