Book Image

Mastering Text Mining with R

By : KUMAR ASHISH
Book Image

Mastering Text Mining with R

By: KUMAR ASHISH

Overview of this book

Text Mining (or text data mining or text analytics) is the process of extracting useful and high-quality information from text by devising patterns and trends. R provides an extensive ecosystem to mine text through its many frameworks and packages. Starting with basic information about the statistics concepts used in text mining, this book will teach you how to access, cleanse, and process text using the R language and will equip you with the tools and the associated knowledge about different tagging, chunking, and entailment approaches and their usage in natural language processing. Moving on, this book will teach you different dimensionality reduction techniques and their implementation in R. Next, we will cover pattern recognition in text data utilizing classification mechanisms, perform entity recognition, and develop an ontology learning framework. By the end of the book, you will develop a practical application from the concepts learned, and will understand how text mining can be leveraged to analyze the massively available data on social media.
Table of Contents (15 chapters)

Bias–variance trade-off and learning curve


It has been observed that non-linear classifiers are usually more powerful than the linear classifiers for text classification problems. But, that does not necessarily imply that a non-linear classifier is the solution to each text classification problem. It is quite interesting to note that there does not exist any optimal learning algorithm that can be universally applicable. Thus, the algorithm selection becomes quite a pivotal part of any modeling exercise. Also, the complexity of a model should not entirely be assumed by the fact that it is a linear or non-linear classifier; there are multiple other aspects of a modeling process, which can lead to complexity in the model, such as feature selection, regularization, and so on.

The error components in a learning model can be categorized broadly as irreducible errors and reducible errors. Irreducible errors are caused by inherent variability in a system; not much can be done about this component...