Book Image

Clojure Data Analysis Cookbook - Second Edition

By : Eric Richard Rochester
Book Image

Clojure Data Analysis Cookbook - Second Edition

By: Eric Richard Rochester

Overview of this book

Table of Contents (19 chapters)
Clojure Data Analysis Cookbook Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Initializing Cascalog and Hadoop for distributed processing


Hadoop was developed by Yahoo! to implement Google's MapReduce algorithm, and then it was open sourced. Since then, it's become one of the most widely tested and used systems for creating distributed processing.

The central part of this ecosystem is Hadoop, but it's also complemented by a range of other tools, including the Hadoop Distributed File System (HDFS) and Pig, a language used to write jobs in order to run them on Hadoop.

One tool that makes working with Hadoop easier is Cascading. This provides a workflow-like layer on top of Hadoop that can make the expression of some data processing and analysis tasks much easier. Cascalog is a Clojure-idiomatic interface to Cascading and, ultimately, Hadoop.

This recipe will show you how to access and query data in Clojure sequences using Cascalog.

Getting ready

First, we have to list our dependencies in the Leiningen project.clj file:

(defproject distrib-data "0.1.0"
  :dependencies [[org...