Book Image

Programming MapReduce with Scalding

By : Antonios Chalkiopoulos
Book Image

Programming MapReduce with Scalding

By: Antonios Chalkiopoulos

Overview of this book

Table of Contents (16 chapters)
Programming MapReduce with Scalding
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Setting a similarity using the Jaccard index


Quite often, we have to work with sets of data in machine learning. Users like posts, buy products, listen to music, or watch movies. In this case, data is structured in the two columns: 'user and 'item.

In order to calculate correlations, we need to work with sets. The Jaccard similarity coefficient is a statistic that measures the similarity between sets. The level of similarity is the calculation of the size of the intersection divided by the size of the union of the sample sets, as shown.

For example, if two users in the dataset are related to the same two items, and each user is also related to a distinct item, the Jaccard similarity indicates the following:

  • The similarity between item1 and item2 is 100 percent

  • The similarity between the common and distinct items is 50 percent

  • The similarity between two distinct items is 0 percent

To begin the implementation, we first need to calculate the item popularity, and then add the popularity back to the...