We can use Latent Dirichlet Allocation (LDA) to cluster a given set of words into topics and a set of documents into combinations of topics. LDA is useful when identifying the meaning of a document or a word based on the context, without solely depending on the number of words or the exact words. LDA is a step away from raw text matching and towards semantic analysis. LDA can be used to identify the intent and to resolve ambiguous words in a system such as a search engine. Some other example use cases of LDA are identifying influential Twitter users for particular topics and Twahpic (http://twahpic.cloudapp.net) application uses LDA to identify topics used on Twitter.
LDA uses the TF vector space model as opposed to the TF-IDF model as it needs to consider the co-occurrence and correlation of words.
Install Apache Mahout in your machine using your Hadoop distribution, or install the latest Apache Mahout version manually.