-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating
Mastering Spark for Data Science
By :
With our text content properly cleaned up, we can feed a Word2Vec algorithm and attempt to understand the words in their actual context.
As it says on the tin, the Word2Vec algorithm transforms a word into a vector. The idea is that similar words will be embedded into similar vector spaces and, as such, will look close to one another contextually.
More information about Word2Vec algorithm can be found at https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf.
Well integrated into Spark, a Word2Vec model can be trained as follows:
import org.apache.spark.mllib.feature.Word2Vec
val corpusRDD = tweetRDD
.map(_.body.split("\\s").toSeq)
.filter(_.distinct.length >= 4)
val model = new Word2Vec().fit(corpusRDD)Here we extract each tweet as a sequence of words, only keeping records with at least 4 distinct words. Note that the list of all words...