Book Image

Clojure Data Analysis Cookbook - Second Edition

By : Eric Richard Rochester
Book Image

Clojure Data Analysis Cookbook - Second Edition

By: Eric Richard Rochester

Overview of this book

Table of Contents (19 chapters)
Clojure Data Analysis Cookbook Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Sampling from very large data sets


One way to deal with very large data sets is to sample. This can be especially useful when we're getting started and want to explore a dataset. A good sample can tell us what's in the full dataset and what we'll need to do in order to clean and process it. Samples are used in any kind of survey or election exit polling.

In this recipe, we'll see a couple of ways of creating samples.

Getting ready

We'll use a basic project.clj file:

(defproject cleaning-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]])

How to do it…

There are two ways to sample from a stream of values. If you want 10 percent of the larger population, you can just take every tenth item. If you want 1,000 out of who knows how many items, the process is a little more complicated.

Sampling by percentage

  1. Performing a rough sampling by percentage is pretty simple:

    (defn sample-percent
      [k coll]  (filter (fn [_] (<= (rand) k)) coll))
  2. Using it is also simple:

    user=> (sample-percent...