Book Image

Mastering Data analysis with R

By : Gergely Daróczi
Book Image

Mastering Data analysis with R

By: Gergely Daróczi

Overview of this book

Table of Contents (19 chapters)
Mastering Data Analysis with R
Credits
www.PacktPub.com
Preface

Importing the corpus


A corpus is basically a collection of text documents that you want to include in the analytics. Use the getSources function to see the available options to import a corpus with the tm package:

> library(tm)
> getSources()
[1] "DataframeSource" "DirSource"  "ReutersSource"   "URISource"
[2] "VectorSource"  

So, we can import text documents from a data.frame, a vector, or directly from a uniform resource identifier with the URISource function. The latter stands for a collection of hyperlinks or file paths, although this is somewhat easier to handle with DirSource, which imports all the textual documents found in the referenced directory on our hard drive. By calling the getReaders function in the R console, you can see the supported text file formats:

> getReaders()
[1] "readDOC"                 "readPDF"                
[3] "readPlain"               "readRCV1"               
[5] "readRCV1asPlain"         "readReut21578XML"       
[7] "readReut21578XMLasPlain" ...