Book Image

Mastering Data analysis with R

By : Gergely Daróczi
Book Image

Mastering Data analysis with R

By: Gergely Daróczi

Overview of this book

Table of Contents (19 chapters)
Mastering Data Analysis with R
Credits
www.PacktPub.com
Preface

Analyzing the associations among terms


The previously computed TermDocumentMatrix, can also be used to identify the association between the cleaned terms found in the corpus. This simply suggests the correlation coefficient computed on the joint occurrence of term-pairs in the same document, which can be queried easily with the findAssocs function.

Let's see which words are associated with data:

> findAssocs(tdm, 'data', 0.1)
             data
set          0.17
analyzing    0.13
longitudinal 0.11
big          0.10

Only four terms seem to have a higher correlation coefficient than 0.1, and it's not surprising at all that analyzing is among the top associated words. Probably, we can ignore the set term, but it seems that longitudinal and big data are pretty frequent idioms in package descriptions. So, what other big terms do we have?

> findAssocs(tdm, 'big', 0.1)
               big
mpi           0.38
pbd           0.33
program       0.32
unidata       0.19
demonstration 0.17
netcdf    ...