One way to deal with very large data sets is to sample them. This can be especially useful when we're first getting started and we want to explore a dataset. A good sample can tell us what's in the full dataset and what we'll need to do to clean and process it.
In this recipe, we'll see a couple of ways of creating samples.
There are two ways to sample from a stream of values. If we want 10 percent of the larger population, we can just take every tenth item. If we want 1000 out of who-knows-how-many items, the process is a little more complicated.
Performing a rough sampling by percentage is pretty simple, as shown in the following code snippet:
(defn sample-percent [k coll] (filter (fn [_] (<= (rand) k)) coll))
Using it is simple also:
user=> (sample-percent 0.01 (range 1000)) (141 146 155 292 598 624 629 640 759 815 852 889) user=> (count *1) 12