Book Image

R for Data Science

By : Dan Toomey
Book Image

R for Data Science

By: Dan Toomey

Overview of this book

Table of Contents (19 chapters)

Chapter 3. Text Mining

A large amount of data available is in the form of text, and it is unstructured, massive, and of tremendous variety. In this chapter, we will have a look at the tools available in R to extract useful information from text.

This chapter describes different ways of mining text. We will cover the following topics:

  • Examining the text in various ways

    • Converting text to lowercase

    • Removing punctuation

    • Removing numbers

    • Removing URLs

    • Removing stop words

    • Using the stems of words rather than instances

    • Building a document matrix delineating uses

  • XML processing, both orthogonal and of varying degrees

  • Examples