Most of the text analysis data mining algorithms operate on vector data. We can use a vector space model to represent text data as a set of vectors. For an example, we can build a vector space model by taking the set of all terms that appear in the dataset and by assigning an index to each term in the term set. Number of terms in the term set is the dimensionality of the resulting vectors and each dimension of the vector corresponds to a term. For each document, the vector contains the number of occurrences of each term at the index location assigned to that particular term. This creates vector space model using term frequencies in each document, similar to the result of the computation we perform in the Generating an inverted index using Hadoop MapReduce recipe of Chapter 7, Searching and Indexing.
However, creating vectors using the preceding term count model gives a lot of weight to...