Book Image

Clojure Data Analysis Cookbook

By : Eric Rochester
Book Image

Clojure Data Analysis Cookbook

By: Eric Rochester

Overview of this book

<p>Data is everywhere and it's increasingly important to be able to gain insights that we can act on. Using Clojure for data analysis and collection, this book will show you how to gain fresh insights and perspectives from your data with an essential collection of practical, structured recipes.<br /><br />"The Clojure Data Analysis Cookbook" presents recipes for every stage of the data analysis process. Whether scraping data off a web page, performing data mining, or creating graphs for the web, this book has something for the task at hand.<br /><br />You'll learn how to acquire data, clean it up, and transform it into useful graphs which can then be analyzed and published to the Internet. Coverage includes advanced topics like processing data concurrently, applying powerful statistical techniques like Bayesian modelling, and even data mining algorithms such as K-means clustering, neural networks, and association rules.</p>
Table of Contents (18 chapters)
Clojure Data Analysis Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Reading CSV data into Incanter datasets


One of the simplest data formats is comma-separated values (CSV). And it's everywhere. Excel reads and writes CSV directly, as do most databases. And because it's really just plain text, it's easy to generate or access it using any programming language.

Getting ready

First, let's make sure we have the correct libraries loaded. The project file of Leiningen (https://github.com/technomancy/leiningen), the project.clj file, should contain these dependencies (although you may be able to use more up-to-date versions):

:dependencies [[org.clojure/clojure "1.4.0"]
               [incanter/incanter-core "1.4.1"]
               [incanter/incanter-io "1.4.1"]]

Also, in your REPL or in your file, include these lines:

(use 'incanter.core
     'incanter.io)

Finally, I have a file named data/small-sample.csv that contains the following data:

Gomez,Addams,father
Morticia,Addams,mother
Pugsley,Addams,brother
Wednesday,Addams,sister
…

You can download this file from http://www.ericrochester.com/clj-data-analysis/data/small-sample.csv. There's a version with a header row at http://www.ericrochester.com/clj-data-analysis/data/small-sample-header.csv.

How to do it…

  1. Use the incanter.io/read-dataset function:

    user=> (read-dataset "data/small-sample.csv")
    [:col0 :col1 :col2]
    ["Gomez" "Addams" "father"]
    ["Morticia" "Addams" "mother"]
    ["Pugsley" "Addams" "brother"]
    ["Wednesday" "Addams" "sister"]
    …
  2. If we have a header row in the CSV file, then we include :header true in the call to read-dataset:

    user=> (read-dataset "data/small-sample-header.csv" :header true)
    [:given-name :surname :relation]
    ["Gomez" "Addams" "father"]
    ["Morticia" "Addams" "mother"]
    ["Pugsley" "Addams" "brother"]

How it works…

Using Clojure and Incanter makes a lot of common tasks easy. This is a good example of that.

We've taken some external data, in this case from a CSV file, and loaded it into an Incanter dataset. In Incanter, a dataset is a table, similar to a sheet in a spreadsheet or a database table. Each column has one field of data, and each row has an observation of data. Some columns will contain string data (all of the columns in this example did), some will contain dates, some numeric data. Incanter tries to detect automatically when a column contains numeric data and coverts it to a Java int or double. Incanter takes away a lot of the pain of importing data.

There's more…

If we don't want to involve Incanter—when you don't want the added dependency, for instance—data.csv is also simple (https://github.com/clojure/data.csv). We'll use this library in later chapters, for example, in the recipe Lazily processing very large datasets of Chapter 2, Cleaning and Validating Data.

See also