Book Image

Clojure Data Analysis Cookbook

By : Eric Rochester
Book Image

Clojure Data Analysis Cookbook

By: Eric Rochester

Overview of this book

<p>Data is everywhere and it's increasingly important to be able to gain insights that we can act on. Using Clojure for data analysis and collection, this book will show you how to gain fresh insights and perspectives from your data with an essential collection of practical, structured recipes.<br /><br />"The Clojure Data Analysis Cookbook" presents recipes for every stage of the data analysis process. Whether scraping data off a web page, performing data mining, or creating graphs for the web, this book has something for the task at hand.<br /><br />You'll learn how to acquire data, clean it up, and transform it into useful graphs which can then be analyzed and published to the Internet. Coverage includes advanced topics like processing data concurrently, applying powerful statistical techniques like Bayesian modelling, and even data mining algorithms such as K-means clustering, neural networks, and association rules.</p>
Table of Contents (18 chapters)
Clojure Data Analysis Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Defining new Cascalog operators


Cascalog comes with a number of operators; however, for most analyses, we'll need to define our own.

For different uses, Cascalog defines a number of different categories of operators, each with different properties. Some are run in the Map phase of processing, and some are run in the Reduce phase. The ones in the Map phase can use a number of extra optimizations, so if we can push some of our processing into that stage, we'll get better performance. In this recipe, we'll see which categories of operators are Map-side and which are Reduce-side. We'll also provide an example of each, and see how they fit into the larger processing model.

Getting ready

For this recipe, we'll use the same dependencies and includes that we did in the Distributed processing with Cascalog and Hadoop recipe. We'll also use the Doctor Who companion data from that recipe.

How to do it…

As I mentioned, Cascalog allows us to specify a number of different operator types. Each type is used...