Book Image

Clojure for Data Science

By : Garner
Book Image

Clojure for Data Science

By: Garner

Overview of this book

The term “data science” has been widely used to define this new profession that is expected to interpret vast datasets and translate them to improved decision-making and performance. Clojure is a powerful language that combines the interactivity of a scripting language with the speed of a compiled language. Together with its rich ecosystem of native libraries and an extremely simple and consistent functional approach to data manipulation, which maps closely to mathematical formula, it is an ideal, practical, and flexible language to meet a data scientist’s diverse needs. Taking you on a journey from simple summary statistics to sophisticated machine learning algorithms, this book shows how the Clojure programming language can be used to derive insights from data. Data scientists often forge a novel path, and you’ll see how to make use of Clojure’s Java interoperability capabilities to access libraries such as Mahout and Mllib for which Clojure wrappers don’t yet exist. Even seasoned Clojure developers will develop a deeper appreciation for their language’s flexibility! You’ll learn how to apply statistical thinking to your own data and use Clojure to explore, analyze, and visualize it in a technically and statistically robust way. You can also use Incanter for local data processing and ClojureScript to present interactive visualisations and understand how distributed platforms such as Hadoop sand Spark’s MapReduce and GraphX’s BSP solve the challenges of data analysis at scale, and how to explain algorithms using those programming models. Above all, by following the explanations in this book, you’ll learn not just how to be effective using the current state-of-the-art methods in data science, but why such methods work so that you can continue to be productive as the field evolves into the future.
Table of Contents (12 chapters)
11
Index

Inspecting the data

Throughout this chapter, and for many other chapters in this book, we'll be using the Incanter library (http://incanter.org/) to load, manipulate, and display data.

Incanter is a modular suite of Clojure libraries that provides statistical computing and visualization capabilities. Modeled after the extremely popular R environment for data analysis, it brings together the power of Clojure, an interactive REPL, and a set of powerful abstractions for working with data.

Each module of Incanter focuses on a specific area of functionality. For example incanter-stats contains a suite of related functions for analyzing data and producing summary statistics, while incanter-charts provides a large number of visualization capabilities. incanter-core provides the most fundamental and generally useful functions for transforming data.

Each module can be included separately in your own code. For access to stats, charts, and Excel features, you could include the following in your project.clj:

  :dependencies [[incanter/incanter-core "1.5.5"]
                 [incanter/incanter-stats "1.5.5"]
                 [incanter/incanter-charts "1.5.5"]
                 [incanter/incanter-excel "1.5.5"]
                 ...]

If you don't mind including more libraries than you need, you can simply include the full Incanter distribution instead:

:dependencies [[incanter/incanter "1.5.5"]
               ...]

At Incanter's core is the concept of a dataset—a structure of rows and columns. If you have experience with relational databases, you can think of a dataset as a table. Each column in a dataset is named, and each row in the dataset has the same number of columns as every other. There are a several ways to load data into an Incanter dataset, and which we use will depend how our data is stored:

  • If our data is a text file (a CSV or tab-delimited file), we can use the read-dataset function from incanter-io
  • If our data is an Excel file (for example, an .xls or .xlsx file), we can use the read-xls function from incanter-excel
  • For any other data source (an external database, website, and so on), as long as we can get our data into a Clojure data structure we can create a dataset with the dataset function in incanter-core

This chapter makes use of Excel data sources, so we'll be using read-xls. The function takes one required argument—the file to load—and an optional keyword argument specifying the sheet number or name. All of our examples have only one sheet, so we'll just provide the file argument as string:

(ns cljds.ch1.data
  (:require [clojure.java.io :as io]
            [incanter.core :as i]
            [incanter.excel :as xls]))

In general, we will not reproduce the namespace declarations from the example code. This is both for brevity and because the required namespaces can usually be inferred by the symbol used to reference them. For example, throughout this book we will always refer to clojure.java.io as io, incanter.core as I, and incanter.excel as xls wherever they are used.

We'll be loading several data sources throughout this chapter, so we've created a multimethod called load-data in the cljds.ch1.data namespace:

(defmulti load-data identity)

(defmethod load-data :uk [_]
  (-> (io/resource "UK2010.xls")
      (str)
      (xls/read-xls)))

In the preceding code, we define the load-data multimethod that dispatches on the identity of the first argument. We also define the implementation that will be called if the first argument is :uk. Thus, a call to (load-data :uk) will return an Incanter dataset containing the UK data. Later in the chapter, we'll define additional load-data implementations for other datasets.

The first row of the UK2010.xls spreadsheet contains column names. Incanter's read-xls function will preserve these as the column names of the returned dataset. Let's begin our exploration of the data by inspecting them now—the col-names function in incanter.core returns the column names as a vector. In the following code (and throughout the book, where we use functions from the incanter.core namespace) we require it as i:

(defn ex-1-1 []
  (i/col-names (load-data :uk)))

As described in running the examples earlier, functions beginning with ex- can be run on the command line with Leiningen like this:

lein run –e 1.1

The output of the preceding command should be the following Clojure vector:

["Press Association Reference" "Constituency Name" "Region" "Election Year" "Electorate" "Votes" "AC" "AD" "AGS" "APNI" "APP" "AWL" "AWP" "BB" "BCP" "Bean" "Best" "BGPV" "BIB" "BIC" "Blue" "BNP" "BP Elvis" "C28" "Cam Soc" "CG" "Ch M" "Ch P" "CIP" "CITY" "CNPG" "Comm" "Comm L" "Con" "Cor D" "CPA" "CSP" "CTDP" "CURE" "D Lab" "D Nat" "DDP" "DUP" "ED" "EIP" "EPA" "FAWG" "FDP" "FFR" "Grn" "GSOT" "Hum" "ICHC" "IEAC" "IFED" "ILEU" "Impact" "Ind1" "Ind2" "Ind3" "Ind4" "Ind5" "IPT" "ISGB" "ISQM" "IUK" "IVH" "IZB" "JAC" "Joy" "JP" "Lab" "Land" "LD" "Lib" "Libert" "LIND" "LLPB" "LTT" "MACI" "MCP" "MEDI" "MEP" "MIF" "MK" "MPEA" "MRLP" "MRP" "Nat Lib" "NCDV" "ND" "New" "NF" "NFP" "NICF" "Nobody" "NSPS" "PBP" "PC" "Pirate" "PNDP" "Poet" "PPBF" "PPE" "PPNV" "Reform" "Respect" "Rest" "RRG" "RTBP" "SACL" "Sci" "SDLP" "SEP" "SF" "SIG" "SJP" "SKGP" "SMA" "SMRA" "SNP" "Soc" "Soc Alt" "Soc Dem" "Soc Lab" "South" "Speaker" "SSP" "TF" "TOC" "Trust" "TUSC" "TUV" "UCUNF" "UKIP" "UPS" "UV" "VCCA" "Vote" "Wessex Reg" "WRP" "You" "Youth" "YRDPL"]

This is a very wide dataset. The first six columns in the data file are described as follows; subsequent columns break the number of votes down by party:

  • Press Association Reference: This is a number identifying the constituency (voting district, represented by one MP)
  • Constituency Name: This is the common name given to the voting district
  • Region: This is the geographic region of the UK where the constituency is based
  • Election Year: This is the year in which the election was held
  • Electorate: This is the total number of people eligible to vote in the constituency
  • Votes: This is the total number of votes cast

Whenever we're confronted with new data, it's important to take time to understand it. In the absence of detailed data definitions, one way we could do this is to begin by validating our assumptions about the data. For example, we expect that this dataset contains information about the 2010 election so let's review the contents of the Election Year column.

Incanter provides the i/$ function (i, as before, signifying the incanter.core namespace) for selecting columns from a dataset. We'll encounter the function regularly throughout this chapter—it's Incanter's primary way of selecting columns from a variety of data representations and it provides several different arities. For now, we'll be providing just the name of the column we'd like to extract and the dataset from which to extract it:

(defn ex-1-2 []
  (i/$ "Election Year" (load-data :uk)))

;; (2010.0 2010.0 2010.0 2010.0 2010.0 ... 2010.0 2010.0 nil)

The years are returned as a single sequence of values. The output may be hard to interpret since the dataset contains so many rows. As we'd like to know which unique values the column contains, we can use the Clojure core function distinct. One of the advantages of using Incanter is that its useful data manipulation functions augment those that Clojure already provides as shown in the following example:

(defn ex-1-3 []
  (->> (load-data :uk)
       (i/$ "Election Year")
       (distinct)))

;; (2010 nil)

The 2010 year goes a long way to confirming our expectations that this data is from 2010. The nil value is unexpected, though, and may indicate a problem with our data.

We don't yet know how many nils exist in the dataset and determining this could help us decide what to do next. A simple way of counting values such as this it to use the core library function frequencies, which returns a map of values to counts:

(defn ex-1-4 [ ]
  (->> (load-data :uk)
       (i/$ "Election Year")
       (frequencies)))

;; {2010.0 650 nil 1}

In the preceding examples, we used Clojure's thread-last macro ->> to chain a several functions together for legibility.

Tip

Along with Clojure's large core library of data manipulation functions, macros such as the one discussed earlier—including the thread-last macro ->>—are other great reasons for using Clojure to analyze data. Throughout this book, we'll see how Clojure can make even sophisticated analysis concise and comprehensible.

It wouldn't take us long to confirm that in 2010 the UK had 650 electoral districts, known as constituencies. Domain knowledge such as this is invaluable when sanity-checking new data. Thus, it's highly probable that the nil value is extraneous and can be removed. We'll see how to do this in the next section.