Clojure Data Analysis Cookbook

Clojure Data Analysis Cookbook

By : Eric Rochester

Buy this Book

Clojure Data Analysis Cookbook

By: Eric Rochester

Buy this Book

Overview of this book

Data is everywhere and it's increasingly important to be able to gain insights that we can act on. Using Clojure for data analysis and collection, this book will show you how to gain fresh insights and perspectives from your data with an essential collection of practical, structured recipes. "The Clojure Data Analysis Cookbook" presents recipes for every stage of the data analysis process. Whether scraping data off a web page, performing data mining, or creating graphs for the web, this book has something for the task at hand. You'll learn how to acquire data, clean it up, and transform it into useful graphs which can then be analyzed and published to the Internet. Coverage includes advanced topics like processing data concurrently, applying powerful statistical techniques like Bayesian modelling, and even data mining algorithms such as K-means clustering, neural networks, and association rules.

Clojure Data Analysis Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Importing Data for Analysis

Introduction

Creating a new project

Reading CSV data into Incanter datasets

Reading JSON data into Incanter datasets

Reading data from Excel with Incanter

Reading data from JDBC databases

Reading XML data into Incanter datasets

Scraping data from tables in web pages

Scraping textual data from web pages

Reading RDF data

Reading RDF data with SPARQL

Aggregating data from different formats

Cleaning and Validating Data

Introduction

Cleaning data with regular expressions

Maintaining consistency with synonym maps

Identifying and removing duplicate data

Normalizing numbers

Rescaling values

Normalizing dates and times

Lazily processing very large data sets

Sampling from very large data sets

Fixing spelling errors

Parsing custom data formats

Validating data with Valip

Managing Complexity with Concurrent Programming

Introduction

Managing program complexity with STM

Managing program complexity with agents

Getting better performance with commute

Combining agents and STM

Maintaining consistency with ensure

Introducing safe side effects into the STM

Maintaining data consistency with validators

Tracking processing with watchers

Debugging concurrent programs with watchers

Recovering from errors in agents

Managing input with sized queues

Improving Performance with Parallel Programming

Introduction

Parallelizing processing with pmap

Parallelizing processing with Incanter

Partitioning Monte Carlo simulations for better pmap performance

Finding the optimal partition size with simulated annealing

Parallelizing with reducers

Generating online summary statistics with reducers

Harnessing your GPU with OpenCL and Calx

Using type hints

Benchmarking with Criterium

Distributed Data Processing with Cascalog

Introduction

Distributed processing with Cascalog and Hadoop

Querying data with Cascalog

Distributing data with Apache HDFS

Parsing CSV files with Cascalog

Complex queries with Cascalog

Aggregating data with Cascalog

Defining new Cascalog operators

Composing Cascalog queries

Handling errors in Cascalog workflows

Transforming data with Cascalog

Executing Cascalog queries in the Cloud with Pallet

Working with Incanter Datasets

Introduction

Loading Incanter's sample datasets

Loading Clojure data structures into datasets

Viewing datasets interactively with view

Converting datasets to matrices

Using infix formulas in Incanter

Selecting columns with $

Selecting rows with $

Filtering datasets with $where

Grouping data with $group-by

Saving datasets to CSV and JSON

Projecting from multiple datasets with $join

Preparing for and Performing Statistical Data Analysis with Incanter

Introduction

Generating summary statistics with $rollup

Differencing variables to show changes

Scaling variables to simplify variable relationships

Working with time series data with Incanter Zoo

Smoothing variables to decrease noise

Validating sample statistics with bootstrapping

Modeling linear relationships

Modeling non-linear relationships

Modeling multimodal Bayesian distributions

Finding data errors with Benford's law

Working with Mathematica and R

Introduction

Setting up Mathematica to talk to Clojuratica for Mac OS X and Linux

Setting up Mathematica to talk to Clojuratica for Windows

Calling Mathematica functions from Clojuratica

Sending matrices to Mathematica from Clojuratica

Evaluating Mathematica scripts from Clojuratica

Creating functions from Mathematica

Processing functions in parallel in Mathematica

Setting up R to talk to Clojure

Calling R functions from Clojure

Passing vectors into R

Evaluating R files from Clojure

Plotting in R from Clojure

Clustering, Classifying, and Working with Weka

Introduction

Loading CSV and ARFF files into Weka

Filtering and renaming columns in Weka datasets

Discovering groups of data using K-means clustering

Finding hierarchical clusters in Weka

Clustering with SOMs in Incanter

Classifying data with decision trees

Classifying data with the Naive Bayesian classifier

Classifying data with support vector machines

Finding associations in data with the Apriori algorithm

Graphing in Incanter

Introduction

Creating scatter plots with Incanter

Creating bar charts with Incanter

Graphing non-numeric data in bar charts

Creating histograms with Incanter

Creating function plots with Incanter

Adding equations to Incanter charts

Adding lines to scatter charts

Customizing charts with JFreeChart

Saving Incanter graphs to PNG

Using PCA to graph multi-dimensional data

Creating dynamic charts with Incanter

Creating Charts for the Web

Introduction

Serving data with Ring and Compojure

Creating HTML with Hiccup

Setting up to use ClojureScript

Creating scatter plots with NVD3

Creating bar charts with NVD3

Creating histograms with NVD3

Visualizing graphs with force-directed layouts

Creating interactive visualizations with D3

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Aggregating data from different formats

Being able to aggregate data from many linked data sources is nice, but most data isn't already formatted for the semantic web. Fortunately, linked data's flexible and dynamic data model facilitates integrating data from multiple sources.

For this recipe, we'll combine several previous ones. We'll load currency data from RDF, as we did in the Reading RDF data recipe, and we'll scrape exchange rate data from X-Rates (http://www.x-rates.com) to get information out of a table, just as we did in the Scraping data from tables in web pages recipe. Finally, we'll dump everything into a triple store and pull it back out, as we did in the last recipe.

Getting ready

First, make sure your project.clj file has the right dependencies:

  :dependencies [[org.clojure/clojure "1.4.0"]
                 [incanter/incanter-core "1.4.1"]
                 [enlive "1.0.1"]
                 [edu.ucdenver.ccp/kr-sesame-core "1.4.5"]
                 [org.clojure/tools.logging "0.2.4"]
                 [org.slf4j/slf4j-simple "1.7.2"]
                 [clj-time "0.4.4"]]

And we need to declare that we'll use these libraries in our script or REPL:

(require '(clojure.java [io :as io]))
(require '(clojure [xml :as xml]
                   [string :as string]
                   [zip :as zip]))
(require '(net.cgrand [enlive-html :as html])
(use 'incanter.core
     'clj-time.coerce
     '[clj-time.format :only (formatter formatters parse unparse)]
     'edu.ucdenver.ccp.kr.kb
     'edu.ucdenver.ccp.kr.rdf
     'edu.ucdenver.ccp.kr.sparql
     'edu.ucdenver.ccp.kr.sesame.kb)

(import [java.io File]
        [java.net URL URLEncoder])

Finally, make sure that you have the file, data/currencies.ttl, which we've been using since the Reading RDF data recipe.

How to do it…

Since this is a longer recipe, we'll build it up in segments. At the end, we'll tie everything together.

Creating the triple store

To begin with, we'll create the triple store. This has become pretty standard. In fact, we'll use the same version of kb-memstore and init-kb that we've been using from the Reading RDF data recipe.

Scraping exchange rates

This is where things get interesting. We'll pull out the timestamp. The first function finds it. The second function normalizes it into a standard format:

(defn find-time-stamp
  ([module-content]
   (second
     (map html/text
          (html/select module-content
                       [:span.ratesTimestamp])))))

(def time-stamp-format
     (formatter "MMM dd, yyyy HH:mm 'UTC'"))

(defn normalize-date
  ([date-time]
   (unparse (formatters :date-time)
            (parse time-stamp-format date-time))))

We'll drill down to get the countries and their exchange rates:

 (defn find-data
  ([module-content]
   (html/select module-content
                [:table.tablesorter.ratesTable 
                 :tbody :tr])))

(defn td->code
  ([td]
   (let [code (-> td
                (html/select [:a])
                first
                :attrs
                :href
                (string/split #"=")
                last)]
     (symbol "currency" (str code "#" code)))))

(defn get-td-a
  ([td]
   (->> td
     :content
     (mapcat :content)
     string/join
     read-string)))

(defn get-data
  ([row]
   (let [[td-header td-to td-from]
         (filter map? (:content row))]
     {:currency (td->code td-to)
      :exchange-to (get-td-a td-to)
      :exchange-from (get-td-a td-from)})))

This function takes the data extracted from the HTML page and generates a list of RDF triples:

(defn data->statements
  ([time-stamp data]
   (let [{:keys [currency exchange-to]} data]
     (list [currency 'err/exchangeRate exchange-to]
           [currency 'err/exchangeWith 
            'currency/USD#USD]
           [currency 'err/exchangeRateDate
            [time-stamp 'xsd/dateTime]]))))

And this function ties those two groups of functions together by pulling the data out of the web page, converting it to triples, and adding them to the database:

(defn load-exchange-data
  "This downloads the HTML page and pulls the data out 
  of it."
  [kb html-url]
  (let [html (html/html-resource html-url)
        div (html/select html [:div.moduleContent])
        time-stamp (normalize-date
                     (find-time-stamp div))]
    (add-statements
      kb
      (mapcat (partial data->statements time-stamp)
              (map get-data (find-data div))))))

That's a mouthful, but now that we can get all the data into a triple store, we just need to pull everything back out and into Incanter.

Loading currency data and tying it all together

Bringing the two data sources together and exporting it to Incanter is fairly easy at this point:

(defn aggregate-data
  "This controls the process and returns the aggregated data."
  [kb data-file data-url q col-map]
  (load-rdf-file kb (File. data-file))
  (load-exchange-data kb (URL. data-url))
  (to-dataset (map (partial rekey col-map) (query kb q))))

We'll need to do a lot of the setup we've done before. Here we'll bind the triple store, the query, and the column map to names, so that we can refer to them easily:

(def t-store (init-kb (kb-memstore)))

(def q 
  '((?/c rdf/type money/Currency)
      (?/c money/name ?/name)
      (?/c money/shortName ?/shortName)
      (?/c money/isoAlpha ?/iso)
      (?/c money/minorName ?/minorName)
      (?/c money/minorExponent ?/minorExponent)
      (:optional
        ((?/c err/exchangeRate ?/exchangeRate)
           (?/c err/exchangeWith ?/exchangeWith)
           (?/c err/exchangeRateDate ?/exchangeRateDate)))))
(def col-map {'?/name :fullname
              '?/iso :iso
              '?/shortName :name
              '?/minorName :minor-name
              '?/minorExponent :minor-exp
              '?/exchangeRate :exchange-rate
              '?/exchangeWith :exchange-with
              '?/exchangeRateDate :exchange-date})

The specific URL that we're going to scrape is http://www.x-rates.com/table/?from=USD&amount=1.00. Let's go ahead and put everything together:

user=> (aggregate-data t-store "data/currencies.ttl" 
  #_=>   "http://www.x-rates.com/table/?from=USD&amount=1.00"
  #_=>   q col-map)
[:exchange-date :name :exchange-with :minor-exp :iso :exchange-rate :minor-name :fullname]
[#<XMLGregorianCalendarImpl 2012-10-03T10:35:00.000Z> "dirham" currency/USD#USD "2" "AED" 3.672981 "fils" "United Arab Emirates dirham"]
[nil "afghani" nil "2" "AFN" nil "pul" "Afghan afghani"]
[nil "lek" nil "2" "ALL" nil "qindarkë" "Albanian lek"]
[nil "dram" nil "0" "AMD" nil "luma" "Armenian dram"]
…

As you can see, some of the data from currencies.ttl doesn't have exchange data (the ones that start with nil). We can look in other sources for that, or decide that some of those currencies don't matter for our project.

How it works…

A lot of this is just a slightly more complicated version of what we've seen before, pulled together into one recipe. The complicated part is scraping the web page, and that's driven by the structure of the page itself.

After looking at the source of the page and playing with it on the REPL the page's structure was clear. First, we needed to pull the timestamp off the top of the table that lists the exchange rates. Then we walked over the table and pulled the data from each row. Both data tables (the short one and the long one) are in a div tag with a class moduleContent, so everything began there.

Next, we drilled down from the module content into the rows of the rates table. Inside each row, we pulled out the currency code and returned it as a symbol in the currency namespace. We also drilled down to the exchange rates and returned them as floats. Then we put everything into a map and converted that to triple vectors, which we added to the triple store.

If you have questions about how we pulled in the main currency data and worked with the triple store, refer to the Reading RDF data recipe.

If you have questions about how we scraped the data from the web page, refer to the Scraping data from tables in web pages recipe.

If you have questions about the SPARQL query, refer to the Reading RDF data with SPARQL recipe.

Clojure Data Analysis Cookbook

By : Eric Rochester

Clojure Data Analysis Cookbook

By: Eric Rochester

Overview of this book

Related Content you might be interested in

Current Title:

Clojure Data Analysis Cookbook

Aggregating data from different formats

Getting ready

How to do it…

Creating the triple store

Scraping exchange rates

Loading currency data and tying it all together

How it works…