Performance optimization can be divided into a set of disciplines, each one focused on some sphere of control. The following diagram shows these disciplines:
At the highest level, we will concern ourselves with the Cascading code that we write. This level of performance is concerned with the pipes, operations, and data flows that we assemble.
The level below this is concerned with the fabric that our job will run on, in this case, Cascading and Hadoop (by version), Tez, Spark, and whatever else may come along. Here, we will be concerned with how we partition our data, the number of reducers that we use, buffer sizes, and other parameters that we can supply to Hadoop.
Finally, at the lowest level, we will be concerned with the operating system and hardware configuration on which our cluster actually runs.
We will focus almost exclusively on our applications, Cascading, and the underlying fabric—in our case, Hadoop.