So far, the Cascalog queries we've seen have all returned tables of results. However, sometimes we'll want to aggregate the tables, to boil them down to a single value, or into a table where groups from the original data are aggregated.
Cascalog makes this easy to do also, and it includes a number of aggregate functions. For this recipe, we'll only use one—cascalog.ops/count
—but you can find more easily in the API documentation on the Cascalog website (http://nathanmarz.github.com/cascalog/cascalog.ops.html).
We'll use the same dependencies and imports as we did in the Distributed processing with Cascalog and Hadoop recipe. We'll also use the same data that we defined in that recipe.