Book Image

Mastering Clojure Data Analysis

By : Eric Richard Rochester
Book Image

Mastering Clojure Data Analysis

By: Eric Richard Rochester

Overview of this book

Table of Contents (17 chapters)
Mastering Clojure Data Analysis
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Visualizing the graph


At this point, it would be really useful to visualize this graph. There are a number of different ways to visualize graphs. We'll use the JavaScript library D3 (data-driven documents, http://d3js.org/) to generate several graph visualizations on subgraphs of the Facebook network data, and we'll look at the pros and cons of each. Finally, we'll use a simple pie chart to visualize how much of the graph is affected as we move outward from a node through its degrees of separation.

Setting up ClojureScript

As I just mentioned, D3 is a JavaScript library. JavaScripts are not bad, but this is a book about Clojure. There's an implementation of the Clojure compiler that takes Clojure and generates JavaScript. So, we'll use that to keep our focus on Clojure while we call JavaScript libraries and deploy them on a browser.

Before we can do that, however, we need to set up our system to use ClojureScript. The first thing we'll need to do is to add the configuration to our project.clj file for this project. This is fairly simple. We just need to declare lein-cljsbuild as a plugin for this project and then configure the ClojureScript compiler. Our project.clj file from earlier is shown as follows, with the relevant lines highlighted as follows:

(defproject network-six "0.1.0-SNAPSHOT"
  :description "FIXME: write description"
  :url "http://example.com/FIXME"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :plugins [[lein-cljsbuild "0.3.2"]]
  :dependencies [[org.slf4j/slf4j-simple "1.7.5"]
                 [org.clojure/clojure "1.5.1"]
                 [org.clojure/data.json "0.2.2"]
                 [me.raynes/fs "1.4.4"]
                 [org.clojure/clojurescript "0.0-2202"]]
  :cljsbuild {:builds
              [{:source-paths ["src-cljs"],
                :compiler {:pretty-printer true,
                :output-to "www/js/main.js",
                :optimizations :whitespace}}]})

The first line adds the lein-cljsbuild plugin to the project. The second block of lines tell Leiningen to watch the src-cljs directory for ClojureScript files. All of these files are then compiled into the www/js/main.js file.

We'll need an HTML file to frame the compiled JavaScript. In the code download, I've included a basic page that's modified from an HTML5 Boilerplate template (http://html5boilerplate.com/). The biggest change is that I've taken out everything that's in the div content.

Also, I added some script tags to load D3 and a D3 plugin for one of the types of graphs that we'll use later. After the tag that loads bootstrap.min.js, I added these:

<script src="http://d3js.org/d3.v3.min.js"></script>
<script src="http://d3js.org/d3.hive.v0.min.js"></script>

Finally, to load the data files asynchronously with AJAX, the www directory will need to be accessible from a web server. There are a number of different options, but if you have Python installed, the easiest option is to probably navigate to the www directory and execute the following command:

$ cd www
$ python -m SimpleHTTPServer
Serving HTTP on 0.0.0.0 port 8000 ...

Now we're ready to proceed. Let's make some charts!

A force-directed layout

One of the standard chart types to visualize graphs is a force-directed layout. These charts use a dynamic-layout algorithm to generate charts that are more clear and look nice. They're modeled on springs. Each vertex repels all the other vertices, but the edges draw the vertices closer.

To have this graph compiled to JavaScript, we start by creating a file named src-cljs/network-six/force.cljs. We'll have a standard namespace declaration at the top of the file:

(ns network-six.force)

Generally, when we use D3, we first set up part of the graph. Then, we get the data. When the data is returned, we continue setting up the graph. In D3, this generally means selecting one or more elements currently in the tree and then selecting some of their children using selectAll. The elements in this new selection may or may not exist at this point. We join the selectAll elements with the data. From this point, we use the enter method most of the time to enter the data items and the nonexistent elements that we selected earlier. If we're updating the data, assuming that the elements already exist, then the process is slightly different. However, the process that uses the enter method, which I described, is the normal workflow that uses D3.

So, we'll start with a little setup for the graph by creating the color palette. In the graph that we're creating, colors will represent the node's distance from a central node. We'll take some time to understand this, because it illustrates some of the differences between Clojure and ClojureScript, and it shows us how to call JavaScript:

(defn make-color []
  (.. js/d3
    -scale
    category10
    (domain (array 0 1 2 3 4 5 6))))

Let's take this bit by bit so that we can understand it all. I'll list a line and then point out what's interesting about it:

  (.. js/d3

There are a couple of things that we need to notice about this line. First,.. is the standard member access macro that we use for Java's interoperability with the main Clojure implementation. In this case, we're using it to construct a series of access calls against a JavaScript object. In this case, the ClojureScript that the macro expands to would be (.domain (.category10 (.-scale js/d3)) (array 0 1 2 3 4 5 6)).

In this case, that object is the main D3 object. The js/ namespace is available by default. It's just an escape hatch to the main JavaScript scope. In this case, it would be the same as accessing a property on the JavaScript window object. You can use this to access anything from JavaScript without having to declare it. I regularly use it with js/console for debugging, for example:

    -scale

This resolves into the JavaScript d3.scale call. The minus sign before scale just means that the call is a property and not a function that takes no arguments. As Clojure doesn't have properties and everything here would look like a function call, ClojureScript needs some way to know that this should not generate a function call. The dash does that as follows:

    category10

This line, combined with the preceding lines, generates JavaScript that looks like d3.scale.category10(). In this case, the call doesn't have a minus sign before it, so the ClojureScript compiler knows that it should generate a function call in this case:

    (domain (array 0 1 2 3 4 5 6))))

Finally, this makes a call to the scale's domain method with an array that sets the domain to the integers between 0 and 6, inclusive of both. These are the values for the distances that we'll look at. The JavaScript for this would be d3.scale.category10().domain([0, 1, 2, 3, 4, 5, 6]).

This function creates and returns a color object. This object is callable, and when it acts as a function that takes a value and returns a color, this will consistently return the same color whenever it's called with a given value from the domain. For example, this way, the distance 1 will also be associated with the same color in the visualization.

This gives us an introduction to the rules for interoperability in ClojureScript. Before we make the call to get the data file, we'll also create the object that takes care of managing the force-directed layout and the D3 object for the svg element. However, you can check the code download provided on the Packt Publishing website for the functions that create these objects.

Next, we need to access the data. We'll see that in a minute, though. First, we need to define some more functions to work with the data once we have it.For the first function, we need to take the force-layout object and associate the data with it.

The data for all of the visualizations has the same format. Each visualization is a JSON object with three keys. The first one, nodes, is an array of JSON objects, each representing one vertex in the graph. The main property of these objects that we're interested in is the data property. This contains the distance of the current vertex from the origin vertex. Next, the links property is a list of JSON objects that represent the edges of the graph. Each link contains the index of a source vertex and a target vertex. Third, the graph property contains the entire graph using the same data structures as we did in Clojure.

The force-directed layout object expects to work with the data from the nodes and the links properties. We set this up and start the animation with the setup-force-layout function:

(defn setup-force-layout [force-layout graph]
  (.. force-layout
    (nodes (.-nodes graph))
    (links (.-links graph))
    start))

As the animation continues, the force-layout object will assign each node and link the object with one or more coordinates. We'll need to update the circles and paths with those values.

We'll do this with a handler for a tick event that the layout object will emit:

(defn on-tick [link node]
  (fn []
    (.. link
      (attr "x1" #(.. % -source -x))
      (attr "y1" #(.. % -source -y))
      (attr "x2" #(.. % -target -x))
      (attr "y2" #(.. % -target -y)))
    (.. node
      (attr "cx" #(.-x %))
      (attr "cy" #(.-y %)))))

Also, at this stage, we create the circle and path elements that represent the vertices and edges. We won't list these functions here.

Finally, we tie everything together. First, we set up the initial objects, then we ask the server for the data, and finally, we create the HTML/SVG elements that represent the data. This is all tied together with the main function:

(defn ^:export main [json-file]
  (let [width 960, height 600
        color (make-color)
        force-layout (make-force-layout width height)
        svg (make-svg width height)]
    (.json js/d3 json-file
           (fn [err graph]
             (.. graph
               -links
               (forEach #(do (aset %1 "weight" 1.0)
                           (aset %1 "index" %2))))
             (setup-force-layout force-layout graph)
             (let [link (make-links svg graph color)
                   node (make-nodes svg graph color force-layout)]
               (.on force-layout "tick"
                    (on-tick link node)))))))

There are a couple of things that we need to notice about this function, and they're both highlighted in the preceding snippet. The first is that the function name has an :export metadata flag attached to it. This just signals that the ClojureScript compiler should make this function accessible from JavaScript outside this namespace. The second is the call to d3.json. This function takes a URL for a JSON data file and a function to handle the results. We'll see more of this function later.

Before we can use this, we need to call it from the HTML page. After the script tag that loads js/main.js, I added this script tag:

<script>
network_six.force.main('facebook-49.json');
</script>

This loads the data file for vertex number 49. This vertex had a betweenness factor of 0.0015, and it could reach four percent of the larger network within six hops. This is small enough to create a meaningful, comprehensible graphic, as seen in the following figure:

The origin vertex (49) is the blue vertex on the lower-right section, almost the farthest-right node of the graph. All the nodes at each hop away from that node will be of a different color. The origin vertex branches to three orange vertices, which link to some green ones. One of the green vertices is in the middle of the larger cluster on the right.

Some aspects of this graph are very helpful. It makes it relatively easy to trace the nodes as they get farther from the origin. This is even easier when interacting with the node in the browser, because it's easy to grab a node and pull it away from its neighbors.

However, it distorts some other information. The graph that we're working with today is not weighted. Theoretically, the links in the graph should be the same length because all the edges have the same weight. In practice, however, it's impossible to display a graph in two dimensions. Force-directed layouts help you display the graph, but the cost is that it's hard to tell exactly what the line lengths and the several clear clusters of various sizes mean on this graph.

Also, the graphs themselves cannot be compared. If we then pulled out a subgraph around a different vertex and charted it, we wouldn't be able to tell much by comparing the two.

So what other options do we have?

A hive plot

The first option is a hive plot. This is a chart type developed by Martin Krzywinski (http://egweb.bcgsc.ca/). These charts are a little different, and reading them can take some time to get used to, but they pack in more meaningful information than force-directed layout or other similar chart types do.

In hive plots, the nodes are positioned along a number of radial axes, often three. Their positions on the axis and which axis they fall on are often meaningful, although the meanings may change between different charts in different domains.

For this, we'll have vertices with a higher degree (with more edges attached to them) be positioned farther out from the center. Vertices closer in will have fewer edges and fewer neighbors. Again, the color of the lines represent the distance of that node from the central node. In this case, we won't make the selection of the axis meaningful.

To create this plot, we'll open a new file, src-cljs/network-six/hive.cljs. At the top, we'll use this namespace declaration:

(ns network-six.hive)

The axis on which a node falls on is an example of a D3 scale; its color from the force layout plot is another scale. Scales are functions that also have properties attached and are accessible via getter or setter functions. However, primarily, when they are passed a data object and a key function, they know how to assign that data object a position on the scale.

In this case, the make-angle function will be used to assign nodes to an axis:

(defn make-angle []
  (.. js/d3
    -scale
    ordinal
    (domain (.range js/d3 4))
    (rangePoints (array 0 (* 2.0 pi)))))

We'll position the nodes along each axis with the get-radius function. This is another scale that takes a vertex and positions it in a range between 40 and 400 according to the number of edges that are connected to it:

(defn get-radius [nodes]
  (.. js/d3
    -scale
    linear
    (range (array 40 400))
    (domain (array (.min js/d3 nodes #(.-count %))
                   (.max js/d3 nodes #(.-count %))))))

We use these scales, along with a scale for color, to position and style the nodes:

(defn make-circles [svg nodes color angle radius]
  (.. svg
    (selectAll ".node")
    (data nodes)
    (enter)
    (append "circle")
    (attr "stroke" #(color (.-data %)))
    (attr "transform"
          #(str "rotate(" (degrees (angle (mod (.-n %) 3))) \)))
    (attr "cx" #(radius (.-count %)))
    (attr "r" 5)
    (attr "class" #(get-classes %))))

I've highlighted the scales that we use in the preceding code snippet. The circle's stroke property comes from the color, which represents the distance of the vertex from the origin for this graph.

The angle is used to assign the circle to an axis using the circle's transform attribute. This is done more or less at random, based on the vertex's index in the data collection.

Finally, the radius scale positions the circle along the axis. This sets the circle's position on the x axis, which is then rotated using the transform attribute and the angle scale.

Again, everything is brought together in the main function. This sets up the scales, requests the data, and then creates and positions the nodes and edges:

(defn ^:export main [json-file]
  (let [width 750, height 750
        angle (make-angle), color (make-color)
        svg (make-svg width height)]
    (.json js/d3 json-file
           (fn [err data]
             (let [nodes (.-nodes data)
                   radius (get-radius nodes)]
               (make-axes svg angle radius)
               (let [df (get-degreed nodes data)]
                 (make-arcs svg nodes df color angle radius)
                 (make-circles svg nodes color angle radius)))))))

Let's see what this graph looks like:

Again, the color represents the distance of the node from the central node. The distance from the center on each axis is the degree of the node.

It's clear from the predominance of the purple-pink color and the bands that the majority of the vertices are six hops from the origin vertex. From the vertices' position on the axes, we can also see that most nodes have a moderate number of edges attached to them. One has quite a few, but most are much closer to the center.

This graph is denser. Although the force-layout graph may have been problematic, it seemed more intuitive and easier to understand, whether it was meaningful or not. Hive plots are more meaningful, but they also take a bit more work to learn to read and to decipher.

A pie chart

Our needs today are simpler than the complex graph we just created; however, we're primarily interested in how much of the network is covered within six hops from a vertex. Neither of the two graphs that we've looked at so far conveyed that well, although they have presented other information and they're commonly used with graphs. We want to know proportions, and the go-to chart for proportions is the pie chart. Maybe it's a little boring, and it's does not strictly speak of a graph visualization per se, but it's clear, and we know what we're dealing with in it.

Generating a pie chart will look very similar to creating a force-directed layout graph or a hive plot. We'll go through the same steps, overall, even though some of the details will be different.

One of the first differences is the function to create an arc. This is similar to a scale, but its output is used to create the d (path description) attribute of the pie chart's wedges:

(defn make-arc [radius]
  (.. js/d3 -svg arc
    (outerRadius (- radius 10))
    (innerRadius 0)))

The pie layout controls the overall process and design of the chart. In this case, we say that we want no sorting, and we need to use the amount property of the data objects:

(defn make-pie []
  (.. js/d3 -layout pie
    (sort nil)
    (value #(.-amount %))))

The other difference in this chart is that we'll need to preprocess the data before it's ready to be fed to the pie layout. Instead of a list of nodes and links, we'll need to give it categories and counts. To make this easier, we'll create a record type for these frequencies:

(defrecord Freq [degree amount])

Also, we'll need a function that takes the same data as the other charts, counts it by distance from the origin vertex, and creates Freq instances to contain that data:

(defn get-freqs [data]
  (->> data
    .-nodes
    (map #(.-data %))
    frequencies
    (map #(Freq. (first %) (second %)))
    into-array))

Again, we pull all these together in the main function, and we do things in the usual way. First, we set up the graph, then we retrieve the data, and finally, we put the two together to create the graph.

In this case, this should give us an idea of how much of the graph this vertex can easily touch. The graph for vertex 49 is shown as follows. We can see that it really doesn't touch much of the network at all. 3799 vertices, more than 95 percent of the network, aren't within six hops of vertex 49.

However, if we compare this with the pie chart for vertex 1085, which was the vertex with the highest betweenness factor, we see a very different picture. For that vertex, more than 95 percent of the network is reachable within 6 hops.

It's also interesting that most of the vertices are four edges away from the origin. For smaller networks, most vertices are further away. However, in this case, it's almost as if it had started running out of vertices in the network.