Book Image

Clojure Data Analysis Cookbook - Second Edition

By : Eric Richard Rochester
Book Image

Clojure Data Analysis Cookbook - Second Edition

By: Eric Richard Rochester

Overview of this book

Table of Contents (19 chapters)
Clojure Data Analysis Cookbook Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Aggregating data from different formats


Being able to aggregate data from many linked data sources is good, but most data isn't already formatted for the semantic Web. Fortunately, linked data's flexible and dynamic data model facilitates the integration of data from multiple sources.

For this recipe, we'll combine several previous recipes. We'll load currency data from RDF, as we did in the Reading RDF data recipe. We'll also scrape the exchange rate data from X-Rates (http://www.x-rates.com) to get information out of a table, just as we did in the Scraping data from tables in web pages recipe. Finally, we'll dump everything into a triple store and pull it back out, as we did in the last recipe.

Getting ready

First, make sure your Leiningen project.clj file has the right dependencies:

(defproject getting-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [incanter "1.5.5"]
                 [enlive "1.1.5"]
                 [edu.ucdenver.ccp/kr-sesame-core "1.4.17"]
                 [org.clojure/tools.logging "0.3.0"]
                 [org.slf4j/slf4j-simple "1.7.7"]
                 [clj-time "0.7.0"]])

We need to declare that we'll use these libraries in our script or REPL:

(require '(clojure.java [io :as io]))
(require '(clojure [xml :as xml]
                   [string :as string]
                   [zip :as zip]))
(require '(net.cgrand [enlive-html :as html])
(use 'incanter.core
     'clj-time.coerce
     '[clj-time.format :only (formatter formatters parse unparse)]
     'edu.ucdenver.ccp.kr.kb
     'edu.ucdenver.ccp.kr.rdf
     'edu.ucdenver.ccp.kr.sparql
     'edu.ucdenver.ccp.kr.sesame.kb)

(import [java.io File]
        [java.net URL URLEncoder])

Finally, make sure that you have the file, data/currencies.ttl, which we've been using since Reading RDF data.

How to do it…

Since this is a longer recipe, we'll build it up in segments. At the end, we'll tie everything together.

Creating the triple store

To begin with, we'll create the triple store. This has become pretty standard. In fact, we'll use the same version of kb-memstore and init-kb that we've been using from the Reading RDF data recipe.

Scraping exchange rates

The first data that we'll pull into the triple store is the current exchange rates:

  1. This is where things get interesting. We'll pull out the timestamp. The first function finds it, and the second function normalizes it into a standard format:

    (defn find-time-stamp
      ([module-content]
       (second
         (map html/text
              (html/select module-content
                           [:span.ratesTimestamp])))))
    
    (def time-stamp-format
         (formatter "MMM dd, yyyy HH:mm 'UTC'"))
    
    (defn normalize-date
      ([date-time]
       (unparse (formatters :date-time)
                (parse time-stamp-format date-time))))
  2. We'll drill down to get the countries and their exchange rates:

     (defn find-data
      ([module-content]
       (html/select module-content
                    [:table.tablesorter.ratesTable 
                     :tbody :tr])))
    
    (defn td->code
      ([td]
       (let [code (-> td
                    (html/select [:a])
                    first
                    :attrs
                    :href
                    (string/split #"=")
                    last)]
         (symbol "currency" (str code "#" code)))))
    
    (defn get-td-a
      ([td]
       (->> td
         :content
         (mapcat :content)
         string/join
         read-string)))
    
    (defn get-data
      ([row]
       (let [[td-header td-to td-from]
             (filter map? (:content row))]
         {:currency (td->code td-to)
          :exchange-to (get-td-a td-to)
          :exchange-from (get-td-a td-from)})))
  3. This function takes the data extracted from the HTML page and generates a list of RDF triples:

    (defn data->statements
      ([time-stamp data]
       (let [{:keys [currency exchange-to]} data]
         (list [currency 'err/exchangeRate exchange-to]
               [currency 'err/exchangeWith 
                'currency/USD#USD]
               [currency 'err/exchangeRateDate
                [time-stamp 'xsd/dateTime]]))))
  4. This function ties all of the processes that we just defined together by pulling the data out of the web page, converting it to triples, and adding them to the database:

    (defn load-exchange-data
      "This downloads the HTML page and pulls the data out 
      of it."
      [kb html-url]
      (let [html (html/html-resource html-url)
            div (html/select html [:div.moduleContent])
            time-stamp (normalize-date
                         (find-time-stamp div))]
        (add-statements
          kb
          (mapcat (partial data->statements time-stamp)
                  (map get-data (find-data div))))))

That's a mouthful, but now that we can get all of the data into a triple store, we just need to pull everything back out and into Incanter.

Loading currency data and tying it all together

Bringing the two data sources together and exporting it to Incanter is fairly easy at this point:

(defn aggregate-data
  "This controls the process and returns the aggregated data."
  [kb data-file data-url q col-map]
  (load-rdf-file kb (File. data-file))
  (load-exchange-data kb (URL. data-url))
  (to-dataset (map (partial rekey col-map) (query kb q))))

We'll need to do a lot of the set up we've done before. Here, we'll bind the triple store, the query, and the column map to names so that we can refer to them easily:

(def t-store (init-kb (kb-memstore)))

(def q 
  '((?/c rdf/type money/Currency)
      (?/c money/name ?/name)
      (?/c money/shortName ?/shortName)
      (?/c money/isoAlpha ?/iso)
      (?/c money/minorName ?/minorName)
      (?/c money/minorExponent ?/minorExponent)
      (:optional
        ((?/c err/exchangeRate ?/exchangeRate)
           (?/c err/exchangeWith ?/exchangeWith)
           (?/c err/exchangeRateDate ?/exchangeRateDate)))))

(def col-map {'?/name :fullname
              '?/iso :iso
              '?/shortName :name
              '?/minorName :minor-name
              '?/minorExponent :minor-exp
              '?/exchangeRate :exchange-rate
              '?/exchangeWith :exchange-with
              '?/exchangeRateDate :exchange-date})

The specific URL that we're going to scrape is http://www.x-rates.com/table/?from=USD&amount=1.00. Let's go ahead and put everything together:

user=> (def d
         (aggregate-data t-store "data/currencies.ttl"
            "http://www.x-rates.com/table/?from=USD&amount=1.00"
            q col-map))
user=> (sel d :rows (range 3)
         :cols [:fullname :name :exchange-rate])

|                   :fullname |  :name | :exchange-rate |
|-----------------------------+--------+----------------|
| United Arab Emirates dirham | dirham |       3.672845 |
| United Arab Emirates dirham | dirham |       3.672845 |
| United Arab Emirates dirham | dirham |       3.672849 |
…

As you will see, some of the data from currencies.ttl doesn't have exchange data (the ones that start with nil). We can look in other sources for that, or decide that some of those currencies don't matter for our project.

How it works…

A lot of this is just a slightly more complicated version of what we've seen before, pulled together into one recipe. The complicated part is scraping the web page, which is driven by the structure of the page itself.

After taking a look at the source for the page and playing with it on the REPL, the page's structure was clear. First, we needed to pull the timestamp off the top of the table that lists the exchange rates. Then, we walked over the table and pulled the data from each row. Both the data tables (the short and long ones) are in a div element with a moduleContent class, so everything begins there.

Next, we drilled down from the module's content into the rows of the rates table. Inside each row, we pulled out the currency code and returned it as a symbol in the currency namespace. We also drilled down to the exchange rates and returned them as floats. Then, we put everything into a map and converted it to triple vectors, which we added to the triple store.

See also

  • For more information on how we pulled in the main currency data and worked with the triple store, see the Reading RDF data recipe.

  • For more information on how we scraped the data from the web page, see Scraping data from tables in web pages.

  • For more information on the SPARQL query, see Reading RDF data with SPARQL.