Book Image

Clojure for Data Science

By : Henry Garner
Book Image

Clojure for Data Science

By: Henry Garner

Overview of this book

Table of Contents (18 chapters)
Clojure for Data Science
Credits
About the Author
Acknowledgments
About the Reviewer
www.PacktPub.com
Preface
Index

Data scrubbing


It is a commonly repeated statistic that at least 80 percent of a data scientist's work is data scrubbing. This is the process of detecting potentially corrupt or incorrect data and either correcting or filtering it out.

Note

Data scrubbing is one of the most important (and time-consuming) aspects of working with data. It's a key step to ensuring that subsequent analysis is performed on data that is valid, accurate, and consistent.

The nil value at the end of the election year column may indicate dirty data that ought to be removed. We've already seen that filtering columns of data can be accomplished with Incanter's i/$ function. For filtering rows of data we can use Incanter's i/query-dataset function.

We let Incanter know which rows we'd like it to filter by passing a Clojure map of column names and predicates. Only rows for which all predicates return true will be retained. For example, to select only the nil values from our dataset:

(-> (load-data :uk)
    (i/query-dataset {"Election Year" {:$eq nil}}))

If you know SQL, you'll notice this is very similar to a WHERE clause. In fact, Incanter also provides the i/$where function, an alias to i/query-dataset that reverses the order of the arguments.

The query is a map of column names to predicates and each predicate is itself a map of operator to operand. Complex queries can be constructed by specifying multiple columns and multiple operators together. Query operators include:

  • :$gt greater than

  • :$lt less than

  • :$gte greater than or equal to

  • :$lte less than or equal to

  • :$eq equal to

  • :$ne not equal to

  • :$in to test for membership of a collection

  • :$nin to test for non-membership of a collection

  • :$fn a predicate function that should return a true response for rows to keep

If none of the built-in operators suffice, the last operator provides the ability to pass a custom function instead.

We'll continue to use Clojure's thread-last macro to make the code intention a little clearer, and return the row as a map of keys and values using the i/to-map function:

(defn ex-1-5 []
  (->> (load-data :uk)
       (i/$where {"Election Year" {:$eq nil}})
       (i/to-map)))

;; {:ILEU nil, :TUSC nil, :Vote nil ... :IVH nil, :FFR nil}

Looking at the results carefully, it's apparent that all (but one) of the columns in this row are nil. In fact, a bit of further exploration confirms that the non-nil row is a summary total and ought to be removed from the data. We can remove the problematic row by updating the predicate map to use the :$ne operator, returning only rows where the election year is not equal to nil:

(->> (load-data :uk)
      (i/$where {"Election Year" {:$ne nil}}))

The preceding function is one we'll almost always want to make sure we call in advance of using the data. One way of doing this is to add another implementation of our load-data multimethod, which also includes this filtering step:

(defmethod load-data :uk-scrubbed [_]
  (->> (load-data :uk)
       (i/$where {"Election Year" {:$ne nil}})))

Now with any code we write, can choose whether to refer to the :uk or :uk-scrubbed datasets.

By always loading the source file and performing our scrubbing on top, we're preserving an audit trail of the transformations we've applied. This makes it clear to us—and future readers of our code—what adjustments have been made to the source. It also means that, should we need to re-run our analysis with new source data, we may be able to just load the new file in place of the existing file.