Datafu is a Pig UDF library open sourced by the SNA
team at LinkedIn. It contains many useful functions. This recipe will use
play counts from the Audioscrobbler
dataset and the
Quantile UDF from datafu to identify and remove outliers.
Download Version 0.0.4 of datafu from https://github.com/linkedin/datafu/downloads.
Uncompress and untar the files. Add the
datafu-0.0.4/dist/ datafu-0.0.4.jar
file to a location accessible by Pig.Download the
Audioscrobbler
dataset from http://www.packtpub.com/support.
Register the datafu JAR file and construct the
Quantile
UDF:register /path/to/datafu-0.0.4.jar; define Quantile datafu.pig.stats.Quantile('.90');
Load the
user_artist_data.txt
file:plays = load '/data/audioscrobbler.txt'using PigStorage(' ') as (user_id:long, artist_id:long, playcount:long);
Group all of the data:
plays_grp = group plays ALL;
Generate the ninetieth percentile value to be used as the...