Book Image

Rapid - Apache Mahout Clustering designs

Book Image

Rapid - Apache Mahout Clustering designs

Overview of this book

Table of Contents (16 chapters)
Apache Mahout Clustering Designs
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Running LDA using Mahout


To run LDA using Mahout, we will use a 20newsgroups dataset. We will convert the corpus into vectors, run LDA on those vectors, and get the resultant topics.

Let's run this example to view how topic modeling works in Mahout.

Dataset selection

We will use 20newsgroups dataset for this exercise. Download the dataset 20news-bydate.tar.gz from http://qwone.com/~jason/20Newsgroups/.

Steps to execute CVB (LDA)

  1. Create a directory 20newsdata and unzip the data here:

    mkdir /tmp/20newsdata
    cdtmp/20newsdata
    tar-xzvf /tmp/20news-bydate.tar.gz
    
  2. There are two folders under 20newsdata, 20news-bydate-test, and 20news-bydate-train. Now, create another directory 20newsdataall and merge both training and test data of the group.

  3. Now, move to the home directory and execute the following command:

    mkdir /tmp/20newsdataall
    cp –R /20newsdata/*/* /tmp/20newsdataall
    
  4. Create a directory in Hadoop and save this data in HDFS:

    hadoopfs –mkdir /usr/hue/20newsdata
    hadoopfs –put /tmp/20newsdataall /usr/hue...