Rapid - Apache Mahout Clustering designs

To run LDA using Mahout, we will use a 20newsgroups dataset. We will convert the corpus into vectors, run LDA on those vectors, and get the resultant topics.

Let's run this example to view how topic modeling works in Mahout.

Dataset selection

We will use 20newsgroups dataset for this exercise. Download the dataset 20news-bydate.tar.gz from http://qwone.com/~jason/20Newsgroups/.

Steps to execute CVB (LDA)

Create a directory 20newsdata and unzip the data here:

mkdir /tmp/20newsdata
cdtmp/20newsdata
tar-xzvf /tmp/20news-bydate.tar.gz

There are two folders under 20newsdata, 20news-bydate-test, and 20news-bydate-train. Now, create another directory 20newsdataall and merge both training and test data of the group.

Now, move to the home directory and execute the following command:

mkdir /tmp/20newsdataall
cp –R /20newsdata/*/* /tmp/20newsdataall

Create a directory in Hadoop and save this data in HDFS:

hadoopfs –mkdir /usr/hue/20newsdata
hadoopfs –put /tmp/20newsdataall /usr/hue...

Rapid - Apache Mahout Clustering designs

Rapid - Apache Mahout Clustering designs

Overview of this book

Related Content you might be interested in

Current Title:

Rapid - Apache Mahout Clustering designs

Running LDA using Mahout

Dataset selection

Steps to execute CVB (LDA)