For our use case, we use data from a subset of the Million Song Dataset, from the University of California Irvine online dataset repository (Lichman, M. (2013)). There are 515,345 cases, with the first 463,715 being training cases and the last 51,630 cases used for testing. The first column of the dataset contains the year and the remaining columns are features from the timbre of the song. Download and decompress the data from here: http://archive.ics.uci.edu/ml/datasets/YearPredictionMSD. Our goal is to predict the year each song was released.
First we need to download the data and then unzip it, which we can do using the following code:
download.file("http://archive.ics.uci.edu/ml/machine-learning-databases/00203/YearPredictionMSD.txt.zip", destfile = "YearPredictionMSD.txt.zip") unzip("YearPredictionMSD.txt.zip")
Now we can read data into R using fread()
from the data.table package. The fread()
function is preferable...