Book Image

Clojure Data Analysis Cookbook - Second Edition

By : Eric Richard Rochester
Book Image

Clojure Data Analysis Cookbook - Second Edition

By: Eric Richard Rochester

Overview of this book

Table of Contents (19 chapters)
Clojure Data Analysis Cookbook Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Distributing data with Apache HDFS


One of the best features of Hadoop is the Hadoop Distributed File System. This creates a network of computers that automatically synchronize their data, making our input data available to all the computers. Not having to worry about how the data gets distributed makes our lives much easier.

For this recipe, we'll put a file into HDFS and read it back out using Cascalog, line by line.

Getting ready

The previous recipes in this chapter used the version of Hadoop that Leiningen downloaded as one of Cascalog's dependencies. For this recipe, however, we'll need to have Hadoop installed and running separately. Go to http://hadoop.apache.org/ and download and install it. You might also be able to use your operating system's package manager. Alternatively, Cloudera has a VM with a 1-node Hadoop cluster that you can download and use (https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads#CDHDownloads-CDH4PackagesandDownloads).

You'll still need to configure everything...