Comparing sparse data using cosine similarity
When a data set has multiple empty fields, comparing the distance using the Manhattan or Euclidean metrics might result in skewed results. Cosine similarity measures how closely two vectors are oriented with each other. For example, the vectors (82, 86) and (86, 82) essentially point in the same direction. In fact, their cosine similarity is equivalent to the cosine similarity between (41, 43) and (43, 41). A cosine similarity of 1 corresponds to vectors that point in the exact same direction, and 0 corresponds to vectors that are completely orthogonal to each other.
As long as the angles between the two vectors are equal, their cosine similarity is equivalent. Applying a distance metric such as the Manhattan distance or Euclidean distance in this case produces a significant difference between the two sets of data.
The cosine similarity between the two vectors is the dot product of the two vectors divided by the product of their magnitudes.