Book Image

Learning Cascading

Book Image

Learning Cascading

Overview of this book

Table of Contents (18 chapters)
Learning Cascading
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Preface
7
Optimizing the Performance of a Cascading Application
Index

Foreword

The Cascading project was started in 2007 to complete the promise that Apache Hadoop was indirectly making to people like me—that we can dramatically simplify data-oriented application development and deployment. This can be done not only from a tools perspective, but more importantly, from an organizational perspective. Take a thousand machines and make them look like one: one storage layer and a computing layer. This promise means I would never have to ask our IT group for another storage array, more disk space, or another overpriced box to manage. My team and I could just throw our data and applications at the cluster and move on. The problem with this story is that we would have to develop our code against the MapReduce model, forcing us to think in MapReduce. I've only ever written one MapReduce application and it was a terrible experience. I did everything wrong.

Cascading was originally designed to help you, the developer, use your existing skills and do the right thing initially, or more importantly, understand whether assumptions about your data were right and wrong so that you can quickly compensate.

In the past couple of years, we have seen the emergence of new models improving on what MapReduce started. These models are driven by new expectations around latency and scale. Thinking in MapReduce is very difficult, but at least you can reason in it. Some of the newer models do not provide enough mental scaffolding to even help you reason out your problem.

Cascading has been evolving to help insulate you from the need to intimately know these new models while allowing businesses to leverage them as they become stable and available. This work started with Cascading 2.0 a few years ago. Cascading 3.0, still under development at the time of writing this, has evolved to a much deeper level, making a new promise to developers that they can write data-oriented applications once, and with little effort can adapt to new models and infrastructure.

What shouldn't come as a surprise is that Cascading has also grown to become its own ecosystem. It supports nearly every type of data source and has been ported to multiple JVM-based programming languages, including Python (Jython), Ruby (JRuby), Scala (via Scalding), and Clojure (via Cascalog). Cascading has a vibrant community of developers and users alike. Having been adopted by companies, such as Twitter, Etsy, eBay, and others, Cascading is the foundation for many advanced forms of analytics from machine learning, genomics, to being the foundation of new data-oriented languages, DSLs, and APIs. It is also the foundation of many commercial products that advertise Hadoop compatibility.

I am very enthusiastic about this book. Here, at Concurrent, we see the need for this book because it consolidates so much information in one place. This book provides details that are not always well documented or can be difficult to find. It contains many concrete coding examples, "how-to" tips, systems integration strategies, performance and tuning steps, debugging techniques, and advice on future directions to take. Also included is a real-life, end-to-end application that can serve as a reference architecture for development using Cascading. This book contains much that is valuable to developers, managers, and system administrators.

Chris K. Wensel

CTO, Concurrent, Inc.