Book Image

R for Data Science

By : Dan Toomey
Book Image

R for Data Science

By: Dan Toomey

Overview of this book

Table of Contents (19 chapters)


While the standard R system has a number of features and functions available, one of the better aspects of R is the use of packages to add functionalities. A package contains a number of functions (and sometimes sample data) that can be used to solve a particular problem in R. Packages are developed by interested groups for the general good of all R developers. In this chapter, we will be using the following packages:

  • tm: This contains text mining tools

  • XML: This contains XML processing tools

Text processing

R has built-in functions for manipulating text. These include adjustments to the text to make it more analyzable (such as using word stems or removing punctuation) and developing a document matrix showing usage of words throughout a document. Once these steps are done, we can then submit our documents to analysis and clustering.


In this example, we will perform the following steps:

  1. We will take an HTML document from the Internet.

  2. We will clean up the document using text processing...