Book Image

Clojure Data Analysis Cookbook

By : Eric Rochester
Book Image

Clojure Data Analysis Cookbook

By: Eric Rochester

Overview of this book

<p>Data is everywhere and it's increasingly important to be able to gain insights that we can act on. Using Clojure for data analysis and collection, this book will show you how to gain fresh insights and perspectives from your data with an essential collection of practical, structured recipes.<br /><br />"The Clojure Data Analysis Cookbook" presents recipes for every stage of the data analysis process. Whether scraping data off a web page, performing data mining, or creating graphs for the web, this book has something for the task at hand.<br /><br />You'll learn how to acquire data, clean it up, and transform it into useful graphs which can then be analyzed and published to the Internet. Coverage includes advanced topics like processing data concurrently, applying powerful statistical techniques like Bayesian modelling, and even data mining algorithms such as K-means clustering, neural networks, and association rules.</p>
Table of Contents (18 chapters)
Clojure Data Analysis Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Reading XML data into Incanter datasets


One of the most popular formats for data is XML. Some people love it, some hate it. But almost everyone has to deal with it at some point. Clojure can use Java's XML libraries, but it also has its own package, which provides a more natural way of working with XML in Clojure.

Getting ready

First, include these dependencies in our Leiningen project.clj file:

  :dependencies [[org.clojure/clojure "1.4.0"]
                 [incanter/incanter-core "1.4.1"]]

Use these libraries in our REPL interpreter or program:

(use 'incanter.core
     'clojure.xml
     '[clojure.zip :exclude [next replace remove]])

And find a data file. I have a file named data/small-sample.xml that looks like the following:

<?xml version="1.0" encoding="utf-8"?>
<data>
  <person>
    <given-name>Gomez</given-name>
    <surname>Addams</surname>
    <relation>father</relation>
  </person>
  …

You can download this data file from http://www.ericrochester.com/clj-data-analysis/data/small-sample.xml.

How to do it…

  1. The solution for this recipe is a little more complicated, so we'll wrap it into a function:

    (defn load-xml-data [xml-file first-data next-data]
      (let [data-map (fn [node]
                       [(:tag node) (first (:content node))])]
        (->>
          ;; 1. Parse the XML data file;
          (parse xml-file)
          xml-zip
          ;; 2. Walk it to extract the data nodes;
          first-data
          (iterate next-data)
          (take-while #(not (nil? %)))
          (map children)
          ;; 3. Convert them into a sequence of maps; and
          (map #(mapcat data-map %))
          (map #(apply array-map %))
          ;; 4. Finally convert that into an Incanter dataset
          to-dataset)))
  2. Which we call in the following manner:

    user=> (load-xml-data "data/small-sample.xml" down right)
    [:given-name :surname :relation]
    ["Gomez" "Addams" "father"]
    ["Morticia" "Addams" "mother"]
    ["Pugsley" "Addams" "brother"]
    …

How it works…

This recipe follows a typical pipeline for working with XML:

  1. It parses an XML data file.

  2. It walks it to extract the data nodes.

  3. It converts them into a sequence of maps representing the data.

  4. And finally, it converts that into an Incanter dataset.

load-xml-data implements this process. It takes three parameters. The input file name, a function that takes the root node of the parsed XML and returns the first data node, and a function that takes a data node and returns the next data node or nil, if there are no more nodes.

First, the function parses the XML file and wraps it in a zipper (we'll discuss more about zippers in a later section). Then it uses the two functions passed in to extract all the data nodes as a sequence. For each data node, it gets its child nodes and converts them into a series of tag-name/content pairs. The pairs for each data node are converted into a map, and the sequence of maps is converted into an Incanter dataset.

There's more…

We used a couple of interesting data structures or constructs in this recipe. Both are common in functional programming or Lisp, but neither has made their way into more mainstream programming. We should spend a minute with them.

Navigating structures with zippers

The first thing that happens to the parsed XML file is it gets passed to clojure.zip/xml-zip. This takes Clojure's native XML data structure and turns it into something that can be navigated quickly using commands such as clojure.zip/down and clojure.zip/right. Being a functional programming language, Clojure prefers immutable data structures; and zippers provide an efficient, natural way to navigate and modify a tree-like structure, such as an XML document.

Zippers are very useful and interesting, and understanding them can help you understand how to work with immutable data structures. For more information on zippers, the Clojure-doc page for this is helpful (http://clojure-doc.org/articles/tutorials/parsing_xml_with_zippers.html). But if you rather like diving into the deep end, see Gerard Huet's paper, The Zipper (http://www.st.cs.uni-saarland.de/edu/seminare/2005/advanced-fp/docs/huet-zipper.pdf).

Processing in a pipeline

Also, we've used the ->> macro to express our process as a pipeline. For deeply nested function calls, this macro lets us read it from right to left, and this makes the process's data flow and series of transformations much more clear.

We can do this in Clojure because of its macro system. ->> simply rewrites the calls into Clojure's native, nested format, as the form is read. The first parameter to the macro is inserted into the next expression as the last parameter. That structure is inserted into the third expression as the last parameter and so on, until the end of the form. Let's trace this through a few steps. Say we start off with the (->> x first (map length) (apply +)) expression. The following is a list of each intermediate step that occurs as Clojure builds the final expression (the elements to be combined are highlighted at each stage):

  1. (->> x first (map length) (apply +))

  2. (->> (first x) (map length) (apply +))

  3. (->> (map length (first x)) (apply +))

  4. (apply + (map length (first x)))

Comparing XML and JSON

XML and JSON (from the Reading JSON data into Incanter datasets recipe) are very similar. Arguably, much of the popularity of JSON is driven by disillusionment with XML's verboseness.

When we're dealing with these formats in Clojure, the biggest difference is that JSON is converted directly to native Clojure data structures that mirror the data, such as maps and vectors. XML, meanwhile, is read into record types that reflect the structure of XML, not the structure of the data.

In other words, the keys of the maps for JSON will come from the domain, first_name or age, for instance. However, the keys of the maps for XML will come from the data format, tag, attribute, or children, say, and the tag and attribute names will come from the domain. This extra level of abstraction makes XML more unwieldy.