Book Image

Mastering Clojure Data Analysis

By : Eric Richard Rochester
Book Image

Mastering Clojure Data Analysis

By: Eric Richard Rochester

Overview of this book

Table of Contents (17 chapters)
Mastering Clojure Data Analysis
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Dealing with messy data


The first thing that we need to deal with is qualitative data from the shape and description fields.

The shape field seems like a likely place to start. Let's see how many items have good data for it:

user=> (def data (m/read-data "data/ufo_awesome.tsv"))
user=> (count (remove (comp str/blank? :shape) data))
58870
user=> (count (filter (comp str/blank? :shape) data))
2523
user=> (count data)
61393
user=> (float 2506/61137)
0.04098991

So 4 percent of the data does not have the shape field set to meaningful data. Let's see what the most popular values for that field are:

user=> (def shape-freqs
           (frequencies
             (map str/trim
                  (map :shape
                       (remove (comp str/blank? :shape) data)))))
#'user/shape-freqs
user=> (pprint (take 10 (reverse (sort-by second shape-freqs))))
(["light" 12202]
 ["triangle" 6082]
 ["circle" 5271]
 ["disk" 4825]
 ["other" 4593]
 ["unknown" 4490]
 ["sphere" 3637]
 ["fireball...