Book Image

Clojure for Data Science

By : Henry Garner
Book Image

Clojure for Data Science

By: Henry Garner

Overview of this book

Table of Contents (18 chapters)
Clojure for Data Science
Credits
About the Author
Acknowledgments
About the Reviewer
www.PacktPub.com
Preface
Index

Jaccard similarity for large sets with MinHash


The Bloom filter is a probabilistic data structure to determine whether an item is a member of a set. While comparing user or item similarities, what we are usually interested in is the intersection between sets, as opposed to their precise contents. MinHash is a technique that enables a large set to be compressed in such a way that we can still perform the Jaccard similarity on the compressed representations.

Let's see how it works with a reference to two of the most prolific raters in the MovieLens dataset. Users 405 and 655 have rated 727 and 675 movies respectively. In the following code, we extract their ratings and convert them into sets before passing to Incanter's jaccard-index function. Recall that this returns the ratio of movies they've both rated out of all the movies they've rated:

(defn rated-items [user-ratings id]
  (->> (get user-ratings id)
       (map :item)))

(defn ex-7-22 []
  (let [ratings      (load-ratings "ua.base...