Book Image

Mastering Text Mining with R

By : KUMAR ASHISH
Book Image

Mastering Text Mining with R

By: KUMAR ASHISH

Overview of this book

Text Mining (or text data mining or text analytics) is the process of extracting useful and high-quality information from text by devising patterns and trends. R provides an extensive ecosystem to mine text through its many frameworks and packages. Starting with basic information about the statistics concepts used in text mining, this book will teach you how to access, cleanse, and process text using the R language and will equip you with the tools and the associated knowledge about different tagging, chunking, and entailment approaches and their usage in natural language processing. Moving on, this book will teach you different dimensionality reduction techniques and their implementation in R. Next, we will cover pattern recognition in text data utilizing classification mechanisms, perform entity recognition, and develop an ontology learning framework. By the end of the book, you will develop a practical application from the concepts learned, and will understand how text mining can be leveraged to analyze the massively available data on social media.
Table of Contents (15 chapters)

Feature extraction


Feature extraction is a very important and valuable step in text mining. A system that can extract features from text has potential to be used in lots of applications. The initial step for feature extraction would be tagging the document; this tagged document is then processed to extract the required entities that are meaningful.

The elements that can be extracted from the text are:

  • Entities: These are some of the pieces of meaningful information that can be found in the document, for example, location, companies, people, and so on

  • Attributes: These are the features of the extracted entities, for example the title of the person, type of organization, and so on

  • Events: These are the activities in which the entities participate, for example, dates

Textual Entailment Human communication is diverse in terms of the usage of different expressions to communicate the same message. This proves to be quite a challenging problem in natural language processing. Consider, for example...