Book Image

Mastering Clojure Data Analysis

By : Eric Richard Rochester
Book Image

Mastering Clojure Data Analysis

By: Eric Richard Rochester

Overview of this book

Table of Contents (17 chapters)
Mastering Clojure Data Analysis
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Implementing the graphs


As the graph data structure is so central to this chapter, we'll take a look at it in more detail before we move on.

There are a number of ways to implement graphs. In this case, we'll use a variation of an adjacency list, which maps each node to a list of its neighbors. We'll store the nodes in a hash map and keep separate hash maps for each node's data. This representation is especially good for sparse graphs, because we only need to store existing links. If the graph is very dense, then representing the set of neighboring nodes as a matrix instead of a hash table will take less memory.

However, before we start looking at the code, let's check out the Leiningen 2 project.clj file. Apart from the Clojure library, this makes use of the Clojure JSON library, the me.raynes file utility library (https://github.com/Raynes/fs), and the Simple Logging Facade for Java library (http://www.slf4j.org/):

(defproject network-six "0.1.0-SNAPSHOT"
  :description "FIXME: write description"
  :url "http://example.com/FIXME"
  :license {:name "Eclipse Public License"
    :url "http://www.eclipse.org/legal/epl-v10.html"}
  :plugins [[lein-cljsbuild "0.3.2"]]
  :dependencies [[org.slf4j/slf4j-simple "1.7.5"]
    [org.clojure/clojure "1.5.1"]
    [org.clojure/data.json "0.2.2"]
    [me.raynes/fs "1.4.4"]
    [org.clojure/clojurescript "0.0-2202"]]
  :cljsbuild {:builds [{:source-paths ["src-cljs"],
    :compiler {:pretty-printer true,
      :output-to "www/js/main.js",
      :optimizations :whitespace}}]})

If you're keeping track, there are several sections related to ClojureScript (https://github.com/clojure/clojurescript) as well. We'll talk about them later in the chapter.

For the first file that we'll work in, open up src/network_six/graph.clj. Use this for the namespace declaration:

(ns network-six.graph
  (:require [clojure.set :as set]
            [clojure.core.reducers :as r]
            [clojure.data.json :as json]
            [clojure.java.io :as io]
            [clojure.set :as set]
            [network-six.util :as u]))

In this namespace, we'll create a Graph record that contains two slots. One is for the map between vertex numbers and sets of neighbors. The second is for the data maps. We'll define an empty graph that we can use anywhere, as follows:

(defrecord Graph
  [neighbors data])
(def empty-graph (Graph. {} {}))

The primary operations that we'll use for this chapter are functions that modify the graph by adding or removing edges or by merging two graphs. The add and delete operations both take an optional flag to treat the edge as bidirectional. In that case, both functions just call themselves with the ends of the edges swapped so that they operate on the edge that goes in the other direction:

(defn update-conj [s x]
  (conj (if (nil? s) #{} s) x))
(defn add
  ([g x y] (add g x y false))
  ([g x y bidirectional?]
   ((if bidirectional? #(add % y x false) identity)
      (update-in g [:neighbors x] #(update-conj % y)))))
(defn delete
  ([g x y] (delete g x y false))
  ([g x y bidirectional?]
   ((if bidirectional? #(delete % y x false) identity)
      (update-in g [:neighbors x] #(disj % y)))))
(defn merge-graphs [a b]
  (Graph. (merge-with set/union (:neighbors a) (:neighbors b))
          (merge (:data a) (:data b))))

The final low-level functions to work with graphs are two functions that are used to set or retrieve data associated with the vertices. Sometimes, it's also useful to be able to store data of the edges, but we won't use that for this implementation. However, we will associate some information with the vertices themselves later on, and when we do that, we'll use these functions.

All of these functions are overloaded. Passed in a graph, a vertex number, and a key, they set or retrieve a value on a hash map that is that vertex's value. Passed in just a graph and a vertex number, they set or retrieve the vertex's value—either the hash map or another value that is there in its place:

(defn get-value
  ([g x] ((:data g) x))
  ([g x k] ((get-value g x) k)))
(defn set-value
  ([g x v] (assoc-in g [:data x] v))
  ([g x k v] (set-value g x (assoc (get-value g x) k v))))
(defn update-value
  ([g x f] (set-value g x (f (get-value g x))))
  ([g x k f] (set-value g x k (f (get-value g x k)))))

We will also want to get the vertices and the edges for the graph. The vertices are the union of the set of all the nodes with outbound edges and the set of nodes with inbound edges. There should be some, or even a lot, of overlap between these two groups. If the graph is bidirectional, then get-edges will return each edge twice—one going from a to b and the other going from b to a:

(defn get-vertices [graph]
  (reduce set/union (set (keys (:neighbors graph)))
          (vals (:neighbors graph))))
(defn get-edges [graph]
  (let [pair-edges (fn [[v neighbors]]
                     (map #(vector v %) neighbors))]
    (mapcat pair-edges (:neighbors graph))))

We'll write some more basic utilities later, but right now, let's take a look at a function that is a slightly higher-level function, but still a fundamental operation on graphs: a breadth-first walk over the graph and a search based on that.

A breadth-first walk traverses the graph by first looking at all the neighbors of the current node. It then looks at the neighbors of those nodes. It continues broadening the search one layer at a time.

This is in opposition to a depth-first walk, which goes deep down one path until there are no outgoing edges to be tried. Then, it backs out to look down other paths.

Which walk is more efficient really depends on the nature of the individual graph and what is being searched for. However, in our case, we're using a breadth-first walk because it ensures that the shortest path between the two nodes will be found first. A depth-first search can't guarantee that.

The backbone of the breadth-first function is a First In, First Out (FIFO) queue. To keep track of the vertices in the paths that we're trying, we use a vector with the index of those vertices. The queue holds all of the active paths. We also keep a set of vertices that we've reached before. This prevents us from getting caught in loops.

We wrap everything in a lazy sequence so that the caller can control how much work is done and what happens to it.

At each step in the loop, the algorithm is pretty standard:

  1. If the queue is empty, then we've exhausted the part of the graph that's accessible from the start node. We're done, and we return null to indicate that we didn't find the node.

  2. Otherwise, we pop a path vector off the queue. The current vertex is the last one.

  3. We get the current vertex's neighbors.

  4. We remove any vertices that we've already considered.

  5. For each neighbor, we append it to the current path vector, creating that many new path vectors. For example, if the current path vector is [0, 171, 4] and the new neighbors are 7, 42 and 532, then we'll create three new vectors: [0, 171, 4, 7], [0, 171, 4, 42], and [0, 171, 4, 532].

  6. We push each of the new path vectors onto the queue.

  7. We add each of the neighbors onto the list of vertices that we've seen.

  8. We output the current path to the lazy sequence.

  9. Finally, we loop back to step one for the rest of the output sequence.

The following code is the implementation of this. Most of it takes place in bf-seq, which sets up the processing in the first clause (two parameters) and constructs the sequence in the second clause (three parameters). The other function, breadth-first, is the public interface to the function:

(defn bf-seq
  ([get-neighbors a]
   (bf-seq
     get-neighbors
     (conj clojure.lang.PersistentQueue/EMPTY [a])
     #{a}))
  ([get-neighbors q seen]
   (lazy-seq
     (when-not (empty? q)
       (let [current (first q)
             nbors (remove seen (get-neighbors (last current)))]
         (cons current
               (bf-seq get-neighbors
                       (into (pop q)
                             (map #(conj current %) nbors))
                       (into seen nbors))))))))
(defn breadth-first [graph a]
  (bf-seq (:neighbors graph) a))

Notice that what makes this a breadth-first search is that we use a FIFO queue. If we used a LIFO (Last In, First Out) queue (a Clojure list works well for this), then this would be a depth-first search. Instead of going broadly and simultaneously trying a number of paths, it would dive deep into the graph along one path and not backtrack to try a new one until it had exhausted the first path.

This is a flexible base on which one can build a number of functionalities. For example, a breadth-first search is now a two-line function:

(defn bfs [graph a b]
  (first (filter #(= (last %) b) (breadth-first graph a))))

These are just filters that find all paths that start from a and end at b and then return the first of those.

Loading the data

Now that we have the fundamental data structure that we're going to use, we can read the data files that we downloaded into a graph.

For the purposes of analyzing the network itself, we're only interested in the *.edges files. This lists the edges in the graph, one edge per line. Each edge is defined by the node numbers that it connects. As Facebook relationships are two-way, the edges represented here are bidirectional. For example, the first few lines of 0.edges are shown as follows:

236 186
122 285
24 346
271 304
176 9

We'll first define a function that reads one edge file into a Graph, and then we'll define another function that walks a directory, reads each edge file, and merges the graphs into one. I'm keeping these in a new namespace, network-six.ego. This is defined in the src/network_six/ego.clj file. It uses the following namespace declaration:

(ns network-six.ego
  (:require [clojure.java.io :as io]
            [clojure.set :as set]
            [clojure.string :as string]
            [clojure.data.json :as json]
            [clojure.core.reducers :as r]
            [network-six.graph :as g]
            [network-six.util :as u]
            [me.raynes.fs :as fs])
  (:import [java.io File]))

Now we'll define the function that reads the *.edges files from a data directory:

(defn read-edge-file [filename]
  (with-open [f (io/reader filename)]
    (->>
      f
      line-seq
      (r/map #(string/split % #"\s+"))
      (r/map #(mapv (fn [x] (Long/parseLong x)) %))
      (r/reduce #(g/add %1 (first %2) (second %2))
                g/empty-graph))))
(defn read-edge-files [ego-dir]
  (r/reduce g/merge-graphs {}
            (r/map read-edge-file
                   (fs/find-files ego-dir #".*\.edges$"))))

We can use these from read-eval-print loop (REPL) to load the data into a graph that we can work with. We can get some basic information about the data at this point, and the following how we'll go about doing that:

User=> (require '[network-six.graph :as g]
                '[network-six.ego :as ego])
user=> (def graph (ego/read-edge-files "facebook/"))
#'user/graph
user=> (count (g/get-vertices graph))
3959
user=> (count (g/get-edges graph))
168486

Now let's dive deeper into the graph and get some other metrics.