One way to deal with very large data sets is to sample. This can be especially useful when we're getting started and want to explore a dataset. A good sample can tell us what's in the full dataset and what we'll need to do in order to clean and process it. Samples are used in any kind of survey or election exit polling.
In this recipe, we'll see a couple of ways of creating samples.
We'll use a basic project.clj
file:
(defproject cleaning-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"]])
There are two ways to sample from a stream of values. If you want 10 percent of the larger population, you can just take every tenth item. If you want 1,000 out of who knows how many items, the process is a little more complicated.