Book Image

Spark Cookbook

By : Rishi Yadav
Book Image

Spark Cookbook

By: Rishi Yadav

Overview of this book

Table of Contents (19 chapters)
Spark Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Collaborative filtering using implicit feedback


Sometimes the feedback available is not in the form of ratings but in the form of audio tracks played, movies watched, and so on. This data, at first glance, may not look as good as explicit ratings by users, but this is much more exhaustive.

Getting ready

We are going to use million song data from http://www.kaggle.com/c/msdchallenge/data. You need to download three files:

  • kaggle_visible_evaluation_triplets

  • kaggle_users.txt

  • kaggle_songs.txt

Now perform the following steps:

  1. Create a songdata folder in hdfs and put all the three files here:

    $ hdfs dfs -mkdir songdata
    
  2. Upload the song data to hdfs:

    $ hdfs dfs -put kaggle_visible_evaluation_triplets.txt songdata/
    $ hdfs dfs -put kaggle_users.txt songdata/
    $ hdfs dfs -put kaggle_songs.txt songdata/
    

We still need to do some more preprocessing. ALS in MLlib takes both user and product IDs as integer. The Kaggle_songs.txt file has song IDs and sequence number next to it, The Kaggle_users.txt file does not...