Book Image

Clojure for Data Science

By : Garner
Book Image

Clojure for Data Science

By: Garner

Overview of this book

The term “data science” has been widely used to define this new profession that is expected to interpret vast datasets and translate them to improved decision-making and performance. Clojure is a powerful language that combines the interactivity of a scripting language with the speed of a compiled language. Together with its rich ecosystem of native libraries and an extremely simple and consistent functional approach to data manipulation, which maps closely to mathematical formula, it is an ideal, practical, and flexible language to meet a data scientist’s diverse needs. Taking you on a journey from simple summary statistics to sophisticated machine learning algorithms, this book shows how the Clojure programming language can be used to derive insights from data. Data scientists often forge a novel path, and you’ll see how to make use of Clojure’s Java interoperability capabilities to access libraries such as Mahout and Mllib for which Clojure wrappers don’t yet exist. Even seasoned Clojure developers will develop a deeper appreciation for their language’s flexibility! You’ll learn how to apply statistical thinking to your own data and use Clojure to explore, analyze, and visualize it in a technically and statistically robust way. You can also use Incanter for local data processing and ClojureScript to present interactive visualisations and understand how distributed platforms such as Hadoop sand Spark’s MapReduce and GraphX’s BSP solve the challenges of data analysis at scale, and how to explain algorithms using those programming models. Above all, by following the explanations in this book, you’ll learn not just how to be effective using the current state-of-the-art methods in data science, but why such methods work so that you can continue to be productive as the field evolves into the future.
Table of Contents (12 chapters)
11
Index

Data scrubbing

It is a commonly repeated statistic that at least 80 percent of a data scientist's work is data scrubbing. This is the process of detecting potentially corrupt or incorrect data and either correcting or filtering it out.

Note

Data scrubbing is one of the most important (and time-consuming) aspects of working with data. It's a key step to ensuring that subsequent analysis is performed on data that is valid, accurate, and consistent.

The nil value at the end of the election year column may indicate dirty data that ought to be removed. We've already seen that filtering columns of data can be accomplished with Incanter's i/$ function. For filtering rows of data we can use Incanter's i/query-dataset function.

We let Incanter know which rows we'd like it to filter by passing a Clojure map of column names and predicates. Only rows for which all predicates return true will be retained. For example, to select only the nil values from our dataset:

(-> (load-data :uk)
    (i/query-dataset {"Election Year" {:$eq nil}}))

If you know SQL, you'll notice this is very similar to a WHERE clause. In fact, Incanter also provides the i/$where function, an alias to i/query-dataset that reverses the order of the arguments.

The query is a map of column names to predicates and each predicate is itself a map of operator to operand. Complex queries can be constructed by specifying multiple columns and multiple operators together. Query operators include:

  • :$gt greater than
  • :$lt less than
  • :$gte greater than or equal to
  • :$lte less than or equal to
  • :$eq equal to
  • :$ne not equal to
  • :$in to test for membership of a collection
  • :$nin to test for non-membership of a collection
  • :$fn a predicate function that should return a true response for rows to keep

If none of the built-in operators suffice, the last operator provides the ability to pass a custom function instead.

We'll continue to use Clojure's thread-last macro to make the code intention a little clearer, and return the row as a map of keys and values using the i/to-map function:

(defn ex-1-5 []
  (->> (load-data :uk)
       (i/$where {"Election Year" {:$eq nil}})
       (i/to-map)))

;; {:ILEU nil, :TUSC nil, :Vote nil ... :IVH nil, :FFR nil}

Looking at the results carefully, it's apparent that all (but one) of the columns in this row are nil. In fact, a bit of further exploration confirms that the non-nil row is a summary total and ought to be removed from the data. We can remove the problematic row by updating the predicate map to use the :$ne operator, returning only rows where the election year is not equal to nil:

(->> (load-data :uk)
      (i/$where {"Election Year" {:$ne nil}}))

The preceding function is one we'll almost always want to make sure we call in advance of using the data. One way of doing this is to add another implementation of our load-data multimethod, which also includes this filtering step:

(defmethod load-data :uk-scrubbed [_]
  (->> (load-data :uk)
       (i/$where {"Election Year" {:$ne nil}})))

Now with any code we write, can choose whether to refer to the :uk or :uk-scrubbed datasets.

By always loading the source file and performing our scrubbing on top, we're preserving an audit trail of the transformations we've applied. This makes it clear to us—and future readers of our code—what adjustments have been made to the source. It also means that, should we need to re-run our analysis with new source data, we may be able to just load the new file in place of the existing file.