Using word2vec to find word relationships
Word2vec has been developed by Tomas Mikolov at Google, around 2012. The original idea behind word2vec was to demonstrate that one might improve efficiency by trading the model's complexity for efficiency. Instead of representing a document as bags of words, word2vec takes each word context into account by trying to analyze n-grams or skip-grams (a set of surrounding tokens with potential the token in question skipped). The words and word contexts themselves are represented by an array of floats/doubles . The objective function is to maximize log likelihood:
Where:
By choosing the optimal and to get a comprehensive word representation (also called map optimization). Similar words are found based on cosine similarity metric (dot product) of . Spark implementation uses hierarchical softmax, which reduces the complexity of computing the conditional probability to , or log of the vocabulary size V, as opposed to , or proportional to V. The training...