More and more data is going up on the Internet using linked data in a variety of formats: microformats, RDFa, and RDF/XML are a few common ones. Linked data adds a lot of flexibility and power, but it also introduces more complexity. Often, to work effectively with linked data, we'll need to start a triple store of some kind. In this recipe and the next three, we'll use Sesame (http://www.openrdf.org/) and the kr
Clojure library (https://github.com/drlivingston/kr).
First, we need to make sure the dependencies are listed in our project.clj
file:
:dependencies [[org.clojure/clojure "1.4.0"] [incanter/incanter-core "1.4.1"] [edu.ucdenver.ccp/kr-sesame-core "1.4.5"] [org.clojure/tools.logging "0.2.4"] [org.slf4j/slf4j-simple "1.7.2"]]
And we'll execute this to have these loaded into our script or REPL:
(use 'incanter.core 'edu.ucdenver.ccp.kr.kb 'edu.ucdenver.ccp.kr.rdf 'edu.ucdenver.ccp.kr.sparql 'edu.ucdenver.ccp.kr.sesame.kb 'clojure.set) (import [java.io File])
For this example, we'll get data from the Telegraphis Linked Data assets. We'll pull down the database of currencies at http://telegraphis.net/data/currencies/currencies.ttl. Just to be safe, I've downloaded that file and saved it as data/currencies.ttl
, and we'll access it from there.
The longest part of this process will be defining the data. The libraries we're using do all the heavy lifting.
First, we will create the triple store and register the namespaces that the data uses. We'll bind that triple store to the name
tstore
.(defn kb-memstore "This creates a Sesame triple store in memory." [] (kb :sesame-mem)) (def tele-ont "http://telegraphis.net/ontology/") (defn init-kb "This creates an in-memory knowledge base and initializes it with a default set of namespaces." [kb-store] (register-namespaces kb-store '(("geographis" (str tele-ont "geography/geography#")) ("code" (str tele-ont "measurement/code#")) ("money" (str tele-ont "money/money#")) ("owl" "http://www.w3.org/2002/07/owl#") ("rdf" (str "http://www.w3.org/" "1999/02/22-rdf-syntax-ns#")) ("xsd" "http://www.w3.org/2001/XMLSchema#") ("currency" (str "http://telegraphis.net/" "data/currencies/")) ("dbpedia" "http://dbpedia.org/resource/") ("dbpedia-ont" "http://dbpedia.org/ontology/") ("dbpedia-prop" "http://dbpedia.org/property/") ("err" "http://ericrochester.com/")))) (def tstore (init-kb (kb-memstore)))
After looking at some more data, we can identify what data we want to pull out and start to formulate a query. We'll use kr's query DSL and bind it to the name
q
:(def q '((?/c rdf/type money/Currency) (?/c money/name ?/full_name) (?/c money/shortName ?/name) (?/c money/symbol ?/symbol) (?/c money/minorName ?/minor_name) (?/c money/minorExponent ?/minor_exp) (?/c money/isoAlpha ?/iso) (?/c money/currencyOf ?/country)))
Now we need a function that takes a result map and converts the variable names in the query into column names in the output dataset. The
header-keyword
andfix-headers
functions will do that:(defn header-keyword "This converts a query symbol to a keyword." [header-symbol] (keyword (.replace (name header-symbol) \_ \-))) (defn fix-headers "This changes all the keys in the map to make them valid header keywords." [coll] (into {} (map (fn [[k v]] [(header-keyword k) v]) coll)))
As usual, once all the pieces are in place, the function that ties everything together is short:
(defn load-data [k rdf-file q] (load-rdf-file k rdf-file) (to-dataset (map fix-headers (query k q))))
And using this function is just as simple:
user=> (load-data t-store (File. "data/currencies.xml") q) [:symbol :country :name :minor-exp :iso :minor-name :fullname] ["إ.د" http://telegraphis.net/data/countries/AE#AE "dirham" "2" "AED" "fils" "United Arab Emirates dirham"] ["؋" http://telegraphis.net/data/countries/AF#AF "afghani" "2" "AFN" "pul" "Afghan afghani"] …
First, some background: Resource Description Format (RDF) isn't an XML format, although it's often written using XML (there are other formats as well, such as N3 and Turtle). RDF sees the world as a set of statements. Each statement has at least three parts (a triple): the subject, the predicate, and the object. The subject and the predicate have to be URIs. (URIs are like URLs, only more general. uri:7890 is a valid URI, for instance.) Objects can be a literal or a URI. The URIs form a graph. They link to each other and make statements about each other. This is where the linked-in linked data comes from.
If you want more information about linked data, http://linkeddata.org/guides-and-tutorials has some good recommendations.
Now about our recipe: From a high level, the process we used here is pretty simple:
Create the triple store (
kb-memstore
andinit-kb
).Load the data (
load-data
).Query it to pull out only what we want (
q
andload-data
).Transform it into a format Incanter can ingest easily (
rekey
andcol-map
).Create the Incanter dataset (
load-data
).
The newest thing here is the query format. kb
uses a nice SPARQL-like DSL to express the queries. In fact, it's so easy to use that we'll deal with it instead of working with raw RDF. The items starting with ?/
are variables; these will be used as keys for the result maps. The other items look like rdf-namespace/value
. The namespace is taken from the registered namespaces defined in init-kb
. These are different from Clojure's namespaces, although they serve a similar function for your data: to partition and provide context.