To run LDA using Mahout, we will use a 20newsgroups
dataset. We will convert the corpus into vectors, run LDA on those vectors, and get the resultant topics.
Let's run this example to view how topic modeling works in Mahout.
We will use 20newsgroups
dataset for this exercise. Download the dataset 20news-bydate.tar.gz
from http://qwone.com/~jason/20Newsgroups/.
Create a directory
20newsdata
and unzip the data here:mkdir /tmp/20newsdata cdtmp/20newsdata tar-xzvf /tmp/20news-bydate.tar.gz
There are two folders under
20newsdata
,20news-bydate-test
, and20news-bydate-train
. Now, create another directory20newsdataall
and merge both training and test data of the group.Now, move to the home directory and execute the following command:
mkdir /tmp/20newsdataall cp –R /20newsdata/*/* /tmp/20newsdataall
Create a directory in Hadoop and save this data in HDFS:
hadoopfs –mkdir /usr/hue/20newsdata hadoopfs –put /tmp/20newsdataall /usr/hue...