Book Image

Clojure Data Analysis Cookbook - Second Edition

By : Eric Richard Rochester
Book Image

Clojure Data Analysis Cookbook - Second Edition

By: Eric Richard Rochester

Overview of this book

Table of Contents (19 chapters)
Clojure Data Analysis Cookbook Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Reading JSON data into Incanter datasets


Another data format that's becoming increasingly popular is JavaScript Object Notation (JSON, http://json.org/). Like CSV, this is a plain text format, so it's easy for programs to work with. It provides more information about the data than CSV does, but at the cost of being more verbose. It also allows the data to be structured in more complicated ways, such as hierarchies or sequences of hierarchies.

Because JSON is a much richer data model than CSV, we might need to transform the data. In that case, we can just pull out the information we're interested in and flatten the nested maps before we pass it to Incanter. In this recipe, however, we'll just work with fairly simple data structures.

Getting ready

First, here are the contents of the Leiningen project.clj file:

(defproject getting-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [incanter "1.5.5"]
                 [org.clojure/data.json "0.2.5"]])

Use these libraries in your REPL or program (inside an ns form):

(require '[incanter.core :as i]
         '[clojure.data.json :as json]
         '[clojure.java.io :as io])
(import '[java.io EOFException])

Moreover, you need some data. For this, I have a file named delicious-rss-214k.json and placed it in the folder named data. It contains a number of top-level JSON objects. For example, the first one starts like this:

{
    "guidislink": false,
    "link": "http://designreviver.com/tips/a-collection-of-wordpress-tutorials-tips-and-themes/",
    "title_detail": {
        "base": "http://feeds.delicious.com/v2/rss/recent?min=1&count=100",
        "value": "A Collection of Wordpress Tutorials, Tips and Themes | Design Reviver",
        "language": null,
        "type": "text/plain"
    },
    "author": "mccarrd4",
…

You can download this data file from Infochimps at http://www.ericrochester.com/clj-data-analysis/data/delicious-rss-214k.json.xz. You'll need to decompress it into the data directory.

How to do it…

Once everything's in place, we'll need a couple of functions to make it easier to handle the multiple JSON objects at the top level of the file:

  1. We'll need a function that attempts to call a function on an instance of java.io.Reader and returns nil if there's an EOFException, in case there's a problem reading the file:

    (defn test-eof [reader f]
      (try
        (f reader)
        (catch EOFException e
          nil)))
  2. Now, we'll build on this to repeatedly parse a JSON document from an instance of java.io.Reader. We do this by repeatedly calling test-eof until eof or until it returns nil, accumulating the returned values as we go:

    (defn read-all-json [reader]
      (loop [accum []]
        (if-let [record (test-eof reader json/read)]
          (recur (conj accum record))
          accum)))
  3. Finally, we'll perform the previously mentioned two steps to read the data from the file:

    (def d (i/to-dataset
             (with-open
               [r (io/reader
                     "data/delicious-rss-214k.json")]
               (read-all-json r))))

This binds d to a new dataset that contains the information read in from the JSON documents.

How it works…

Similar to all Lisp's (List Processing), Clojure is usually read from the inside out and from right to left. Let's break it down. clojure.java.io/reader opens the file for reading. read-all-json parses all of the JSON documents in the file into a sequence. In this case, it returns a vector of the maps. incanter.core/to-dataset takes a sequence of maps and returns an Incanter dataset. This dataset will use the keys in the maps as column names, and it will convert the data values into a matrix. Actually, to-dataset can accept many different data structures. Try doc to-dataset in the REPL (doc shows the documentation string attached to the function), or see the Incanter documentation at http://data-sorcery.org/contents/ for more information.