Book Image

Clojure Data Analysis Cookbook

By : Eric Rochester
Book Image

Clojure Data Analysis Cookbook

By: Eric Rochester

Overview of this book

<p>Data is everywhere and it's increasingly important to be able to gain insights that we can act on. Using Clojure for data analysis and collection, this book will show you how to gain fresh insights and perspectives from your data with an essential collection of practical, structured recipes.<br /><br />"The Clojure Data Analysis Cookbook" presents recipes for every stage of the data analysis process. Whether scraping data off a web page, performing data mining, or creating graphs for the web, this book has something for the task at hand.<br /><br />You'll learn how to acquire data, clean it up, and transform it into useful graphs which can then be analyzed and published to the Internet. Coverage includes advanced topics like processing data concurrently, applying powerful statistical techniques like Bayesian modelling, and even data mining algorithms such as K-means clustering, neural networks, and association rules.</p>
Table of Contents (18 chapters)
Clojure Data Analysis Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Aggregating data from different formats


Being able to aggregate data from many linked data sources is nice, but most data isn't already formatted for the semantic web. Fortunately, linked data's flexible and dynamic data model facilitates integrating data from multiple sources.

For this recipe, we'll combine several previous ones. We'll load currency data from RDF, as we did in the Reading RDF data recipe, and we'll scrape exchange rate data from X-Rates (http://www.x-rates.com) to get information out of a table, just as we did in the Scraping data from tables in web pages recipe. Finally, we'll dump everything into a triple store and pull it back out, as we did in the last recipe.

Getting ready

First, make sure your project.clj file has the right dependencies:

  :dependencies [[org.clojure/clojure "1.4.0"]
                 [incanter/incanter-core "1.4.1"]
                 [enlive "1.0.1"]
                 [edu.ucdenver.ccp/kr-sesame-core "1.4.5"]
                 [org.clojure/tools.logging "0.2.4"]
                 [org.slf4j/slf4j-simple "1.7.2"]
                 [clj-time "0.4.4"]]

And we need to declare that we'll use these libraries in our script or REPL:

(require '(clojure.java [io :as io]))
(require '(clojure [xml :as xml]
                   [string :as string]
                   [zip :as zip]))
(require '(net.cgrand [enlive-html :as html])
(use 'incanter.core
     'clj-time.coerce
     '[clj-time.format :only (formatter formatters parse unparse)]
     'edu.ucdenver.ccp.kr.kb
     'edu.ucdenver.ccp.kr.rdf
     'edu.ucdenver.ccp.kr.sparql
     'edu.ucdenver.ccp.kr.sesame.kb)

(import [java.io File]
        [java.net URL URLEncoder])

Finally, make sure that you have the file, data/currencies.ttl, which we've been using since the Reading RDF data recipe.

How to do it…

Since this is a longer recipe, we'll build it up in segments. At the end, we'll tie everything together.

Creating the triple store

To begin with, we'll create the triple store. This has become pretty standard. In fact, we'll use the same version of kb-memstore and init-kb that we've been using from the Reading RDF data recipe.

Scraping exchange rates

  1. This is where things get interesting. We'll pull out the timestamp. The first function finds it. The second function normalizes it into a standard format:

    (defn find-time-stamp
      ([module-content]
       (second
         (map html/text
              (html/select module-content
                           [:span.ratesTimestamp])))))
    
    (def time-stamp-format
         (formatter "MMM dd, yyyy HH:mm 'UTC'"))
    
    (defn normalize-date
      ([date-time]
       (unparse (formatters :date-time)
                (parse time-stamp-format date-time))))
  2. We'll drill down to get the countries and their exchange rates:

     (defn find-data
      ([module-content]
       (html/select module-content
                    [:table.tablesorter.ratesTable 
                     :tbody :tr])))
    
    (defn td->code
      ([td]
       (let [code (-> td
                    (html/select [:a])
                    first
                    :attrs
                    :href
                    (string/split #"=")
                    last)]
         (symbol "currency" (str code "#" code)))))
    
    (defn get-td-a
      ([td]
       (->> td
         :content
         (mapcat :content)
         string/join
         read-string)))
    
    (defn get-data
      ([row]
       (let [[td-header td-to td-from]
             (filter map? (:content row))]
         {:currency (td->code td-to)
          :exchange-to (get-td-a td-to)
          :exchange-from (get-td-a td-from)})))
  3. This function takes the data extracted from the HTML page and generates a list of RDF triples:

    (defn data->statements
      ([time-stamp data]
       (let [{:keys [currency exchange-to]} data]
         (list [currency 'err/exchangeRate exchange-to]
               [currency 'err/exchangeWith 
                'currency/USD#USD]
               [currency 'err/exchangeRateDate
                [time-stamp 'xsd/dateTime]]))))
  4. And this function ties those two groups of functions together by pulling the data out of the web page, converting it to triples, and adding them to the database:

    (defn load-exchange-data
      "This downloads the HTML page and pulls the data out 
      of it."
      [kb html-url]
      (let [html (html/html-resource html-url)
            div (html/select html [:div.moduleContent])
            time-stamp (normalize-date
                         (find-time-stamp div))]
        (add-statements
          kb
          (mapcat (partial data->statements time-stamp)
                  (map get-data (find-data div))))))

That's a mouthful, but now that we can get all the data into a triple store, we just need to pull everything back out and into Incanter.

Loading currency data and tying it all together

Bringing the two data sources together and exporting it to Incanter is fairly easy at this point:

(defn aggregate-data
  "This controls the process and returns the aggregated data."
  [kb data-file data-url q col-map]
  (load-rdf-file kb (File. data-file))
  (load-exchange-data kb (URL. data-url))
  (to-dataset (map (partial rekey col-map) (query kb q))))

We'll need to do a lot of the setup we've done before. Here we'll bind the triple store, the query, and the column map to names, so that we can refer to them easily:

(def t-store (init-kb (kb-memstore)))

(def q 
  '((?/c rdf/type money/Currency)
      (?/c money/name ?/name)
      (?/c money/shortName ?/shortName)
      (?/c money/isoAlpha ?/iso)
      (?/c money/minorName ?/minorName)
      (?/c money/minorExponent ?/minorExponent)
      (:optional
        ((?/c err/exchangeRate ?/exchangeRate)
           (?/c err/exchangeWith ?/exchangeWith)
           (?/c err/exchangeRateDate ?/exchangeRateDate)))))
(def col-map {'?/name :fullname
              '?/iso :iso
              '?/shortName :name
              '?/minorName :minor-name
              '?/minorExponent :minor-exp
              '?/exchangeRate :exchange-rate
              '?/exchangeWith :exchange-with
              '?/exchangeRateDate :exchange-date})

The specific URL that we're going to scrape is http://www.x-rates.com/table/?from=USD&amount=1.00. Let's go ahead and put everything together:

user=> (aggregate-data t-store "data/currencies.ttl" 
  #_=>   "http://www.x-rates.com/table/?from=USD&amount=1.00"
  #_=>   q col-map)
[:exchange-date :name :exchange-with :minor-exp :iso :exchange-rate :minor-name :fullname]
[#<XMLGregorianCalendarImpl 2012-10-03T10:35:00.000Z> "dirham" currency/USD#USD "2" "AED" 3.672981 "fils" "United Arab Emirates dirham"]
[nil "afghani" nil "2" "AFN" nil "pul" "Afghan afghani"]
[nil "lek" nil "2" "ALL" nil "qindarkë" "Albanian lek"]
[nil "dram" nil "0" "AMD" nil "luma" "Armenian dram"]
…

As you can see, some of the data from currencies.ttl doesn't have exchange data (the ones that start with nil). We can look in other sources for that, or decide that some of those currencies don't matter for our project.

How it works…

A lot of this is just a slightly more complicated version of what we've seen before, pulled together into one recipe. The complicated part is scraping the web page, and that's driven by the structure of the page itself.

After looking at the source of the page and playing with it on the REPL the page's structure was clear. First, we needed to pull the timestamp off the top of the table that lists the exchange rates. Then we walked over the table and pulled the data from each row. Both data tables (the short one and the long one) are in a div tag with a class moduleContent, so everything began there.

Next, we drilled down from the module content into the rows of the rates table. Inside each row, we pulled out the currency code and returned it as a symbol in the currency namespace. We also drilled down to the exchange rates and returned them as floats. Then we put everything into a map and converted that to triple vectors, which we added to the triple store.

If you have questions about how we pulled in the main currency data and worked with the triple store, refer to the Reading RDF data recipe.

If you have questions about how we scraped the data from the web page, refer to the Scraping data from tables in web pages recipe.

If you have questions about the SPARQL query, refer to the Reading RDF data with SPARQL recipe.