Often, simply querying data won't do everything we need. For instance, the data may not be in a form we can use. In that case, we'll need to transform the data. We can do that easily in Cascalog too.
For this recipe, we'll define a custom operation and use it to split year ranges in the form 2000–2010 into two fields.
We'll use the same dependencies and includes that we did in the Distributed processing with Cascalog and Hadoop recipe. We'll also use the Doctor Who companion data from that recipe.
We'll define a new, custom operation to take a date range string and split it into two values. In this dataset, we're splitting them on an N-dash (
#"\u2013"
). If the input isn't a range (that is, it's just a year), then the year is returned for both the start and end of the range.(defmapop split-range [date-range] (let [[from to] (string/split (str date-range) #"\u2013" 2)] [from (if (nil? to) from (str (.substring from 0 2) to))])...