Book Image

Learning Cascading

Book Image

Learning Cascading

Overview of this book

Table of Contents (18 chapters)
Learning Cascading
About the Authors
About the Reviewers
Optimizing the Performance of a Cascading Application

The Cascading framework

Now let's look very briefly at Cascading. Cascading provides a much higher-level API in addition to MapReduce, and as we shall soon see, many other types of big data "fabrics," such as Tez, Spark, and Storm. Additionally, Cascading provides an abstraction that insulates our code from the underlying fabric and the data source type and protocol. It also gives us an integrated orchestration layer that allows us to build sophisticated sequences of jobs, and it provides rich out-of-the-box functionalities. MapReduce programmers have realized very quickly that much of the code that write is dedicated to some very basic things, such as preparing the sort keys and handling the comparisons for the sort. As we saw previously, MapReduce is verbose! Even such a simple task requires five classes and well over 100 lines of code.

Fundamentally, in this code, little is actually occurring other than key preparation, partitioning, sorting, and counting. Cascading handles this by assigning the input and output sources and destinations, creating the sort keys, performing the sort, and then performing the counting. It accomplished all this in a single class with an astonishing 20 lines of code! We need to know a little bit more about Cascading, so after we gain this understanding in a later chapter, we will look at this code in detail and then return to compare the differences, outlining the efficiency gains.

To a large degree, Cascading hides much of the complexity of MapReduce and of many big data programming complexities in general. Now, to be perfectly clear, Cascading has its own set of complexities. It also provides a standardized approach that has a smaller surface area than all of Hadoop. Cascading is, in fact, a domain-specific language (DSL) for Hadoop that encapsulates map, reduce, partitioning, sorting, and analytical operations in a concise form. This DSL is written in a fluent style, and this makes coding and understanding of the resulting code line much easier.


A fluent interface (also sometimes known as a builder pattern) is one in which each call to a class returns an object (called its context) through which the method operates. This functional style allows for concise code to be written where the resulting lines of code resemble a "scripted language," as shown here:

    Company company = new Company("XYZ Corp");
    company.setAddress("Company Address")

The execution graph and flow planner

When Cascading executes its job code, it is really preparing an execution graph. This graph has as its vertices every process that must be performed. These are things, such as reading and transforming records, performing sorts, performing aggregations, and writing results. Its edges are in the form of all of the data exchanges that occur between these processing steps. After this graph is prepared, Cascading plans how it will be executed. The planner is specific to the underlying framework. In the preceding example, we use HadoopFlowConnector to do this. Herein lies the beauty of Cascading. There are other connectors.

LocalFlowConnector can run the job without needing Hadoop at all. It is simply run as a connected set of Java programs. Using this connector, a developer can test their code in isolation. This is very valuable for a developer.

In future, you can see how TezConnector, SparkConnector, and others can be created. So, what we've seen is that we can write one code line and then execute it on differing frameworks. We have magically freed our code from being frozen in place! We've now gain the ability to move to newer, more performant, and more feature-rich big data frameworks without requiring expensive rewrites.

How Cascading produces MapReduce jobs

After the execution graph is produced, the creation of MapReduce jobs is relatively easy. Most everything that we tell Cascading is translated into mappers, reducers, partitioners, and comparators. The lower-level semantics of performing record mapping, multifield sorts, and so on are handled for us.

Additionally, as we shall soon see, Cascading provides a rich set of high-level functions to do basic ETL work, such as regular expression parsing, data transformation, data validation, error handling, and much more.