A very useful distance metric between strings is provided by the TfIdfDistance
class. It is, in fact, closely related to the distance metric from the popular open source search engine, Lucene/SOLR/Elastic Search, where the strings being compared are the query against documents in the index. Tf-Idf stands for the core formula that is term frequency (TF) times
inverse document frequency (IDF) for terms shared by the query and the document. A very cool thing about this approach is that common terms (for example, the
) that are very frequent in documents are downweighted, while rare terms are upweighted in the distance comparison. This can help focus the distance on terms that are actually discriminating in the document collection.
Not only does TfIdfDistance
come in handy for search-engine-like applications, it can be very useful for clustering and for any problem that calls for document similarity without supervised training data. It has a desirable property; scores are...