Book Image

Clojure for Data Science

By : Henry Garner
Book Image

Clojure for Data Science

By: Henry Garner

Overview of this book

Table of Contents (18 chapters)
Clojure for Data Science
Credits
About the Author
Acknowledgments
About the Reviewer
www.PacktPub.com
Preface
Index

Summary


In this chapter, we learned some of the fundamental techniques of distributed data processing and saw how the functions used locally for data processing, map and reduce, are powerful ways of processing even very large quantities of data. We learned how Hadoop can scale unbounded by the capabilities of any single server by running functions on smaller subsets of the data whose outputs are themselves combined to finally produce a result. Once you understand the tradeoffs, this "divide and conquer" approach toward processing data is a simple and very general way of analyzing data on a large scale.

We saw both the power and limitations of simple folds to process data using both Clojure's reducers and Tesser. We've also begun exploring how Parkour exposes more of Hadoop's underlying capabilities.

In the next chapter, we'll see how to use Hadoop and Parkour to address a particular machine learning challenge—clustering a large volume of text documents.