One of the simplest data formats is comma-separated values (CSV). And it's everywhere. Excel reads and writes CSV directly, as do most databases. And because it's really just plain text, it's easy to generate or access it using any programming language.
First, let's make sure we have the correct libraries loaded. The project file of Leiningen (https://github.com/technomancy/leiningen), the project.clj
file, should contain these dependencies (although you may be able to use more up-to-date versions):
:dependencies [[org.clojure/clojure "1.4.0"] [incanter/incanter-core "1.4.1"] [incanter/incanter-io "1.4.1"]]
Also, in your REPL or in your file, include these lines:
(use 'incanter.core 'incanter.io)
Finally, I have a file named data/small-sample.csv
that contains the following data:
Gomez,Addams,father Morticia,Addams,mother Pugsley,Addams,brother Wednesday,Addams,sister …
You can download this file from http://www.ericrochester.com/clj-data-analysis/data/small-sample.csv. There's a version with a header row at http://www.ericrochester.com/clj-data-analysis/data/small-sample-header.csv.
Use the
incanter.io/read-dataset
function:user=> (read-dataset "data/small-sample.csv") [:col0 :col1 :col2] ["Gomez" "Addams" "father"] ["Morticia" "Addams" "mother"] ["Pugsley" "Addams" "brother"] ["Wednesday" "Addams" "sister"] …
If we have a header row in the CSV file, then we include
:header true
in the call toread-dataset
:user=> (read-dataset "data/small-sample-header.csv" :header true) [:given-name :surname :relation] ["Gomez" "Addams" "father"] ["Morticia" "Addams" "mother"] ["Pugsley" "Addams" "brother"]
Using Clojure and Incanter makes a lot of common tasks easy. This is a good example of that.
We've taken some external data, in this case from a CSV file, and loaded it into an Incanter dataset. In Incanter, a dataset is a table, similar to a sheet in a spreadsheet or a database table. Each column has one field of data, and each row has an observation of data. Some columns will contain string data (all of the columns in this example did), some will contain dates, some numeric data. Incanter tries to detect automatically when a column contains numeric data and coverts it to a Java int
or double
. Incanter takes away a lot of the pain of importing data.
If we don't want to involve Incanter—when you don't want the added dependency, for instance—data.csv
is also simple (https://github.com/clojure/data.csv). We'll use this library in later chapters, for example, in the
recipe Lazily processing very large datasets of Chapter 2, Cleaning and Validating Data.
Chapter 6, Working with Incanter Datasets