Clojure Data Analysis Cookbook

Clojure Data Analysis Cookbook

By : Eric Rochester

Buy this Book

Clojure Data Analysis Cookbook

By: Eric Rochester

Buy this Book

Overview of this book

Data is everywhere and it's increasingly important to be able to gain insights that we can act on. Using Clojure for data analysis and collection, this book will show you how to gain fresh insights and perspectives from your data with an essential collection of practical, structured recipes. "The Clojure Data Analysis Cookbook" presents recipes for every stage of the data analysis process. Whether scraping data off a web page, performing data mining, or creating graphs for the web, this book has something for the task at hand. You'll learn how to acquire data, clean it up, and transform it into useful graphs which can then be analyzed and published to the Internet. Coverage includes advanced topics like processing data concurrently, applying powerful statistical techniques like Bayesian modelling, and even data mining algorithms such as K-means clustering, neural networks, and association rules.

Clojure Data Analysis Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Importing Data for Analysis

Introduction

Creating a new project

Reading CSV data into Incanter datasets

Reading JSON data into Incanter datasets

Reading data from Excel with Incanter

Reading data from JDBC databases

Reading XML data into Incanter datasets

Scraping data from tables in web pages

Scraping textual data from web pages

Reading RDF data

Reading RDF data with SPARQL

Aggregating data from different formats

Cleaning and Validating Data

Introduction

Cleaning data with regular expressions

Maintaining consistency with synonym maps

Identifying and removing duplicate data

Normalizing numbers

Rescaling values

Normalizing dates and times

Lazily processing very large data sets

Sampling from very large data sets

Fixing spelling errors

Parsing custom data formats

Validating data with Valip

Managing Complexity with Concurrent Programming

Introduction

Managing program complexity with STM

Managing program complexity with agents

Getting better performance with commute

Combining agents and STM

Maintaining consistency with ensure

Introducing safe side effects into the STM

Maintaining data consistency with validators

Tracking processing with watchers

Debugging concurrent programs with watchers

Recovering from errors in agents

Managing input with sized queues

Improving Performance with Parallel Programming

Introduction

Parallelizing processing with pmap

Parallelizing processing with Incanter

Partitioning Monte Carlo simulations for better pmap performance

Finding the optimal partition size with simulated annealing

Parallelizing with reducers

Generating online summary statistics with reducers

Harnessing your GPU with OpenCL and Calx

Using type hints

Benchmarking with Criterium

Distributed Data Processing with Cascalog

Introduction

Distributed processing with Cascalog and Hadoop

Querying data with Cascalog

Distributing data with Apache HDFS

Parsing CSV files with Cascalog

Complex queries with Cascalog

Aggregating data with Cascalog

Defining new Cascalog operators

Composing Cascalog queries

Handling errors in Cascalog workflows

Transforming data with Cascalog

Executing Cascalog queries in the Cloud with Pallet

Working with Incanter Datasets

Introduction

Loading Incanter's sample datasets

Loading Clojure data structures into datasets

Viewing datasets interactively with view

Converting datasets to matrices

Using infix formulas in Incanter

Selecting columns with $

Selecting rows with $

Filtering datasets with $where

Grouping data with $group-by

Saving datasets to CSV and JSON

Projecting from multiple datasets with $join

Preparing for and Performing Statistical Data Analysis with Incanter

Introduction

Generating summary statistics with $rollup

Differencing variables to show changes

Scaling variables to simplify variable relationships

Working with time series data with Incanter Zoo

Smoothing variables to decrease noise

Validating sample statistics with bootstrapping

Modeling linear relationships

Modeling non-linear relationships

Modeling multimodal Bayesian distributions

Finding data errors with Benford's law

Working with Mathematica and R

Introduction

Setting up Mathematica to talk to Clojuratica for Mac OS X and Linux

Setting up Mathematica to talk to Clojuratica for Windows

Calling Mathematica functions from Clojuratica

Sending matrices to Mathematica from Clojuratica

Evaluating Mathematica scripts from Clojuratica

Creating functions from Mathematica

Processing functions in parallel in Mathematica

Setting up R to talk to Clojure

Calling R functions from Clojure

Passing vectors into R

Evaluating R files from Clojure

Plotting in R from Clojure

Clustering, Classifying, and Working with Weka

Introduction

Loading CSV and ARFF files into Weka

Filtering and renaming columns in Weka datasets

Discovering groups of data using K-means clustering

Finding hierarchical clusters in Weka

Clustering with SOMs in Incanter

Classifying data with decision trees

Classifying data with the Naive Bayesian classifier

Classifying data with support vector machines

Finding associations in data with the Apriori algorithm

Graphing in Incanter

Introduction

Creating scatter plots with Incanter

Creating bar charts with Incanter

Graphing non-numeric data in bar charts

Creating histograms with Incanter

Creating function plots with Incanter

Adding equations to Incanter charts

Adding lines to scatter charts

Customizing charts with JFreeChart

Saving Incanter graphs to PNG

Using PCA to graph multi-dimensional data

Creating dynamic charts with Incanter

Creating Charts for the Web

Introduction

Serving data with Ring and Compojure

Creating HTML with Hiccup

Setting up to use ClojureScript

Creating scatter plots with NVD3

Creating bar charts with NVD3

Creating histograms with NVD3

Visualizing graphs with force-directed layouts

Creating interactive visualizations with D3

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Reading RDF data with SPARQL

For the previous recipe, the embedded domain-specific language (EDSL) used for the query gets converted to SPARQL, the query language for many linked data systems. If you squint just right at the query, it looks kind of like a SPARQL WHERE clause. It's a simple query, but one nevertheless.

And this worked great when we had access to the raw data in our own triple store. However, if we need to access a remote SPARQL end-point directly, it's more complicated.

For this recipe, we'll query DBPedia (http://dbpedia.org) for information about the United Arab Emirates' currency, the dirham. DBPedia extracts structured information from Wikipedia (the summary boxes) and re-publishes it as RDF. Just as Wikipedia is a useful first-stop for humans to get information about something, DBPedia is a good starting point for computer programs gathering data about a domain.

Getting ready

First, we need to make sure the dependencies are listed in our project.clj file:

  :dependencies [[org.clojure/clojure "1.4.0"]
                 [incanter/incanter-core "1.4.1"]
                 [edu.ucdenver.ccp/kr-sesame-core "1.4.5"]
                 [org.clojure/tools.logging "0.2.4"]
                 [org.slf4j/slf4j-simple "1.7.2"]]

Then, load the Clojure and Java libraries that we'll use.

(require '(clojure.java [io :as io]))
(require '(clojure [xml :as xml] 
                   [pprint :as pp]
                   [zip :as zip]))
(use 'incanter.core
     '[clojure.set :only (rename-keys)]
     'edu.ucdenver.ccp.kr.kb
     'edu.ucdenver.ccp.kr.rdf
     'edu.ucdenver.ccp.kr.sparql
     'edu.ucdenver.ccp.kr.sesame.kb)
(import [java.io File]
        [java.net URL URLEncoder])

How to do it…

As we work through this, we'll define a series of functions. Finally, we'll create one function, load-data, to orchestrate everything, and we'll finish by calling it.

We have to create a Sesame triple store and initialize it with the namespaces that we'll use. For both of these we'll use the kb-memstore and init-kb functions that we discussed in the previous recipe. We define a function that takes a URI for a subject in the triple store and constructs a SPARQL query that returns at most 200 statements about that. It filters out any statements with non-English strings for objects, but it allows everything else through:
```
(defn make-query
  "This creates a query that returns all the 
  triples related to a subject URI. It does 
  filter out non-English strings."
  ([subject kb]
   (binding [*kb* kb
             *select-limit* 200]
     (sparql-select-query
       (list '(~subject ?/p ?/o)
             '(:or (:not (:isLiteral ?/o))
                   (!= (:datatype ?/o) rdf/langString)
                   (= (:lang ?/o) ["en"])))))))
```

Now that we have the query, we'll need to encode it into a URL to retrieve the results:

(defn make-query-uri
  "This constructs a URI for the query."
  ([base-uri query]
   (URL. (str base-uri
              "?format=" 
              (URLEncoder/encode "text/xml")
              "&query=" (URLEncoder/encode query)))))

Once we get a result, we'll parse the XML file, wrap it in a zipper, and navigate to the first result. All this will be in a function that we'll write in a minute. Right now, the next function will take that first result node and return a list of all of the results:
```
(defn result-seq
  "This takes the first result and returns a sequence 
  of this node, plus all the nodes to the right  of it."
  ([first-result]
   (cons (zip/node first-result)
         (zip/rights first-result))))
```

The following set of functions takes each result node and returns a key-value pair (result-to-kv). It uses binding-str to pull the results out of the XML file. Then accum-hash function pushes those key-value pairs into a map. Keys that occur more than once have their values accumulated in a vector.

(defn binding-str
  "This takes a binding, pulls out the first tag's 
  content, and concatenates it into a string."
  ([b]
   (apply str (:content (first (:content b))))))

(defn result-to-kv
  "This takes a result node and creates a key-value 
  vector pair from it."
  ([r]
   (let [[p o] (:content r)]
     [(binding-str p) (binding-str o)])))

(defn accum-hash
  "This takes a map and key-value vector pair and adds 
  the pair to the map. If the key is already in the 
  map, the current value is converted to a vector and 
  the new value is added to it."
  ([m [k v]]
   (if-let [current (m k)]
     (assoc m k (conj current v))
     (assoc m k [v]))))

For the last utility function, we'll define rekey. This will convert the keys of a map based on another map:

(defn rekey
  "This just flips the arguments for 
  clojure.set/rename-keys to make it more
  convenient."
  ([k-map map]
   (rename-keys 
     (select-keys map (keys k-map)) k-map)))

Now, let's add a function that takes a SPARQL endpoint and a subject, and returns a sequence of result nodes. This will use several of the functions we've just defined.

(defn query-sparql-results
  "This queries a SPARQL endpoint and returns a 
  sequence of result nodes."
  ([sparql-uri subject kb]
   (->>
     kb
     ;; Build the URI query string.
     (make-query subject)
     (make-query-uri sparql-uri)
     ;; Get the results, parse the XML,
     ;; and return the zipper.
     io/input-stream
     xml/parse
     zip/xml-zip
     ;; Find the first child.
     zip/down
     zip/right
     zip/down
     ;; Convert all children into a sequence.
     result-seq)))

Finally, we can pull everything together. Here's load-data:

(defn load-data
  "This loads the data about a currency for the 
  given URI."
  [sparql-uri subject col-map]
  (->>
    ;; Initialize the triple store.
    (kb-memstore)
    init-kb
    ;; Get the results.
    (query-sparql-results sparql-uri subject)
    ;; Generate a mapping.
    (map result-to-kv)
    (reduce accum-hash {})
    ;; Translate the keys in the map.
    (rekey col-map)
    ;; And create a dataset.
    to-dataset))

Now let's use it. We can define a set of variables to make it easier to reference the namespaces that we'll use. We'll use them to create a mapping to column names:

(def rdfs "http://www.w3.org/2000/01/rdf-schema#")
(def dbpedia "http://dbpedia.org/resource/")
(def dbpedia-ont "http://dbpedia.org/ontology/")
(def dbpedia-prop "http://dbpedia.org/property/")

(def col-map {(str rdfs 'label) :name,
              (str dbpedia-prop 'usingCountries) :country
              (str dbpedia-prop 'peggedWith) :pegged-with
              (str dbpedia-prop 'symbol) :symbol
              (str dbpedia-prop 'usedBanknotes) :used-banknotes
              (str dbpedia-prop 'usedCoins) :used-coins
              (str dbpedia-prop 'inflationRate) :inflation})

We call load-data with the DBPedia SPARQL endpoint, the resource we want information about (as a symbol), and the column map:

user=> (load-data "http://dbpedia.org/sparql" 
  #_=>   (symbol (str dbpedia "/United_Arab_Emirates_dirham")) 
  #_=>   col-map)
[:used-coins :symbol :pegged-with :country :inflation :name :used-banknotes]
["2550" "إ.د" "U.S. dollar = 3.6725 dirhams" "United Arab Emirates" "14" "United Arab Emirates dirham" "9223372036854775807"]

How it works…

The only part of this recipe that has to do with SPARQL, really, is the function make-query. It uses the function sparql-select-query to generate a SPARQL query string from the query pattern. This pattern has to be interpreted in the context of the triple store that has the namespaces defined. This context is set using the binding command. We can see how this function works by calling it from the REPL by itself:

user=> (println 
  #_=>   (make-query 
  #_=>     (symbol (str dbpedia "/United_Arab_Emirates_dirham"))
  #_=>     (init-kb (kb-memstore))))
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?p ?o
WHERE {  <http://dbpedia.org/resource/United_Arab_Emirates_dirham> ?p   ?o .
 FILTER (  ( ! isLiteral(?o)
 ||  (  datatype(?o)  !=        <http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> )
 ||  (  lang(?o)  = "en" )  )
 )
} LIMIT 200

The rest of the recipe is concerned with parsing the XML format of the results, and in many ways it's similar to the last recipe.

Clojure Data Analysis Cookbook

By : Eric Rochester

Clojure Data Analysis Cookbook

By: Eric Rochester

Overview of this book

Related Content you might be interested in

Current Title:

Clojure Data Analysis Cookbook

Reading RDF data with SPARQL

Getting ready

How to do it…

How it works…

See also