Clojure Data Analysis Cookbook

Clojure Data Analysis Cookbook

By : Eric Rochester

Buy this Book

Clojure Data Analysis Cookbook

By: Eric Rochester

Buy this Book

Overview of this book

Data is everywhere and it's increasingly important to be able to gain insights that we can act on. Using Clojure for data analysis and collection, this book will show you how to gain fresh insights and perspectives from your data with an essential collection of practical, structured recipes. "The Clojure Data Analysis Cookbook" presents recipes for every stage of the data analysis process. Whether scraping data off a web page, performing data mining, or creating graphs for the web, this book has something for the task at hand. You'll learn how to acquire data, clean it up, and transform it into useful graphs which can then be analyzed and published to the Internet. Coverage includes advanced topics like processing data concurrently, applying powerful statistical techniques like Bayesian modelling, and even data mining algorithms such as K-means clustering, neural networks, and association rules.

Clojure Data Analysis Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Importing Data for Analysis

Introduction

Creating a new project

Reading CSV data into Incanter datasets

Reading JSON data into Incanter datasets

Reading data from Excel with Incanter

Reading data from JDBC databases

Reading XML data into Incanter datasets

Scraping data from tables in web pages

Scraping textual data from web pages

Reading RDF data

Reading RDF data with SPARQL

Aggregating data from different formats

Cleaning and Validating Data

Introduction

Cleaning data with regular expressions

Maintaining consistency with synonym maps

Identifying and removing duplicate data

Normalizing numbers

Rescaling values

Normalizing dates and times

Lazily processing very large data sets

Sampling from very large data sets

Fixing spelling errors

Parsing custom data formats

Validating data with Valip

Managing Complexity with Concurrent Programming

Introduction

Managing program complexity with STM

Managing program complexity with agents

Getting better performance with commute

Combining agents and STM

Maintaining consistency with ensure

Introducing safe side effects into the STM

Maintaining data consistency with validators

Tracking processing with watchers

Debugging concurrent programs with watchers

Recovering from errors in agents

Managing input with sized queues

Improving Performance with Parallel Programming

Introduction

Parallelizing processing with pmap

Parallelizing processing with Incanter

Partitioning Monte Carlo simulations for better pmap performance

Finding the optimal partition size with simulated annealing

Parallelizing with reducers

Generating online summary statistics with reducers

Harnessing your GPU with OpenCL and Calx

Using type hints

Benchmarking with Criterium

Distributed Data Processing with Cascalog

Introduction

Distributed processing with Cascalog and Hadoop

Querying data with Cascalog

Distributing data with Apache HDFS

Parsing CSV files with Cascalog

Complex queries with Cascalog

Aggregating data with Cascalog

Defining new Cascalog operators

Composing Cascalog queries

Handling errors in Cascalog workflows

Transforming data with Cascalog

Executing Cascalog queries in the Cloud with Pallet

Working with Incanter Datasets

Introduction

Loading Incanter's sample datasets

Loading Clojure data structures into datasets

Viewing datasets interactively with view

Converting datasets to matrices

Using infix formulas in Incanter

Selecting columns with $

Selecting rows with $

Filtering datasets with $where

Grouping data with $group-by

Saving datasets to CSV and JSON

Projecting from multiple datasets with $join

Preparing for and Performing Statistical Data Analysis with Incanter

Introduction

Generating summary statistics with $rollup

Differencing variables to show changes

Scaling variables to simplify variable relationships

Working with time series data with Incanter Zoo

Smoothing variables to decrease noise

Validating sample statistics with bootstrapping

Modeling linear relationships

Modeling non-linear relationships

Modeling multimodal Bayesian distributions

Finding data errors with Benford's law

Working with Mathematica and R

Introduction

Setting up Mathematica to talk to Clojuratica for Mac OS X and Linux

Setting up Mathematica to talk to Clojuratica for Windows

Calling Mathematica functions from Clojuratica

Sending matrices to Mathematica from Clojuratica

Evaluating Mathematica scripts from Clojuratica

Creating functions from Mathematica

Processing functions in parallel in Mathematica

Setting up R to talk to Clojure

Calling R functions from Clojure

Passing vectors into R

Evaluating R files from Clojure

Plotting in R from Clojure

Clustering, Classifying, and Working with Weka

Introduction

Loading CSV and ARFF files into Weka

Filtering and renaming columns in Weka datasets

Discovering groups of data using K-means clustering

Finding hierarchical clusters in Weka

Clustering with SOMs in Incanter

Classifying data with decision trees

Classifying data with the Naive Bayesian classifier

Classifying data with support vector machines

Finding associations in data with the Apriori algorithm

Graphing in Incanter

Introduction

Creating scatter plots with Incanter

Creating bar charts with Incanter

Graphing non-numeric data in bar charts

Creating histograms with Incanter

Creating function plots with Incanter

Adding equations to Incanter charts

Adding lines to scatter charts

Customizing charts with JFreeChart

Saving Incanter graphs to PNG

Using PCA to graph multi-dimensional data

Creating dynamic charts with Incanter

Creating Charts for the Web

Introduction

Serving data with Ring and Compojure

Creating HTML with Hiccup

Setting up to use ClojureScript

Creating scatter plots with NVD3

Creating bar charts with NVD3

Creating histograms with NVD3

Visualizing graphs with force-directed layouts

Creating interactive visualizations with D3

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Reading XML data into Incanter datasets

One of the most popular formats for data is XML. Some people love it, some hate it. But almost everyone has to deal with it at some point. Clojure can use Java's XML libraries, but it also has its own package, which provides a more natural way of working with XML in Clojure.

Getting ready

First, include these dependencies in our Leiningen project.clj file:

  :dependencies [[org.clojure/clojure "1.4.0"]
                 [incanter/incanter-core "1.4.1"]]

Use these libraries in our REPL interpreter or program:

(use 'incanter.core
     'clojure.xml
     '[clojure.zip :exclude [next replace remove]])

And find a data file. I have a file named data/small-sample.xml that looks like the following:

<?xml version="1.0" encoding="utf-8"?>
<data>
  <person>
    <given-name>Gomez</given-name>
    <surname>Addams</surname>
    <relation>father</relation>
  </person>
  …

You can download this data file from http://www.ericrochester.com/clj-data-analysis/data/small-sample.xml.

How to do it…

The solution for this recipe is a little more complicated, so we'll wrap it into a function:

(defn load-xml-data [xml-file first-data next-data]
  (let [data-map (fn [node]
                   [(:tag node) (first (:content node))])]
    (->>
      ;; 1. Parse the XML data file;
      (parse xml-file)
      xml-zip
      ;; 2. Walk it to extract the data nodes;
      first-data
      (iterate next-data)
      (take-while #(not (nil? %)))
      (map children)
      ;; 3. Convert them into a sequence of maps; and
      (map #(mapcat data-map %))
      (map #(apply array-map %))
      ;; 4. Finally convert that into an Incanter dataset
      to-dataset)))

Which we call in the following manner:

user=> (load-xml-data "data/small-sample.xml" down right)
[:given-name :surname :relation]
["Gomez" "Addams" "father"]
["Morticia" "Addams" "mother"]
["Pugsley" "Addams" "brother"]
…

How it works…

This recipe follows a typical pipeline for working with XML:

It parses an XML data file.
It walks it to extract the data nodes.
It converts them into a sequence of maps representing the data.
And finally, it converts that into an Incanter dataset.

load-xml-data implements this process. It takes three parameters. The input file name, a function that takes the root node of the parsed XML and returns the first data node, and a function that takes a data node and returns the next data node or nil, if there are no more nodes.

First, the function parses the XML file and wraps it in a zipper (we'll discuss more about zippers in a later section). Then it uses the two functions passed in to extract all the data nodes as a sequence. For each data node, it gets its child nodes and converts them into a series of tag-name/content pairs. The pairs for each data node are converted into a map, and the sequence of maps is converted into an Incanter dataset.

There's more…

We used a couple of interesting data structures or constructs in this recipe. Both are common in functional programming or Lisp, but neither has made their way into more mainstream programming. We should spend a minute with them.

Navigating structures with zippers

The first thing that happens to the parsed XML file is it gets passed to clojure.zip/xml-zip. This takes Clojure's native XML data structure and turns it into something that can be navigated quickly using commands such as clojure.zip/down and clojure.zip/right. Being a functional programming language, Clojure prefers immutable data structures; and zippers provide an efficient, natural way to navigate and modify a tree-like structure, such as an XML document.

Zippers are very useful and interesting, and understanding them can help you understand how to work with immutable data structures. For more information on zippers, the Clojure-doc page for this is helpful (http://clojure-doc.org/articles/tutorials/parsing_xml_with_zippers.html). But if you rather like diving into the deep end, see Gerard Huet's paper, The Zipper (http://www.st.cs.uni-saarland.de/edu/seminare/2005/advanced-fp/docs/huet-zipper.pdf).

Processing in a pipeline

Also, we've used the ->> macro to express our process as a pipeline. For deeply nested function calls, this macro lets us read it from right to left, and this makes the process's data flow and series of transformations much more clear.

We can do this in Clojure because of its macro system. ->> simply rewrites the calls into Clojure's native, nested format, as the form is read. The first parameter to the macro is inserted into the next expression as the last parameter. That structure is inserted into the third expression as the last parameter and so on, until the end of the form. Let's trace this through a few steps. Say we start off with the (->> x first (map length) (apply +)) expression. The following is a list of each intermediate step that occurs as Clojure builds the final expression (the elements to be combined are highlighted at each stage):

(->> x first (map length) (apply +))
(->> (first x) (map length) (apply +))
(->> (map length (first x)) (apply +))
(apply + (map length (first x)))

Comparing XML and JSON

XML and JSON (from the Reading JSON data into Incanter datasets recipe) are very similar. Arguably, much of the popularity of JSON is driven by disillusionment with XML's verboseness.

When we're dealing with these formats in Clojure, the biggest difference is that JSON is converted directly to native Clojure data structures that mirror the data, such as maps and vectors. XML, meanwhile, is read into record types that reflect the structure of XML, not the structure of the data.

In other words, the keys of the maps for JSON will come from the domain, first_name or age, for instance. However, the keys of the maps for XML will come from the data format, tag, attribute, or children, say, and the tag and attribute names will come from the domain. This extra level of abstraction makes XML more unwieldy.

Clojure Data Analysis Cookbook

By : Eric Rochester

Clojure Data Analysis Cookbook

By: Eric Rochester

Overview of this book

Related Content you might be interested in

Current Title:

Clojure Data Analysis Cookbook

Reading XML data into Incanter datasets

Getting ready

How to do it…

How it works…

There's more…

Navigating structures with zippers

Processing in a pipeline

Comparing XML and JSON