Book Image

Clojure High Performance Programming

By : Shantanu Kumar
Book Image

Clojure High Performance Programming

By: Shantanu Kumar

Overview of this book

<p>Clojure is a young, dynamic, functional programming language that runs on the Java Virtual Machine. It is built with performance, pragmatism, and simplicity in mind. Like most general purpose languages, Clojure’s features have different performance characteristics that one should know in order to write high performance code.<br /><br />Clojure High Performance Programming is a practical, to-the-point guide that shows you how to evaluate the performance implications of different Clojure abstractions, learn about their underpinnings, and apply the right approach for optimum performance in real-world programs.<br /><br />This book discusses the Clojure language in the light of performance factors that you can exploit in your own code.</p> <p>You will also learn about hardware and JVM internals that also impact Clojure’s performance. Key features include performance vocabulary, performance analysis, optimization techniques, and how to apply these to your programs. You will also find detailed information on Clojure's concurrency, state-management, and parallelization primitives.</p> <p>This book is your key to writing high performance Clojure code using the right abstraction, in the right place, using the right technique.</p>
Table of Contents (15 chapters)
Clojure High Performance Programming
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

Use case classification


Performance requirements and priority vary across different kinds of use cases. We need to determine what constitutes acceptable performance for various kinds of use cases. Hence, we classify them to identify their performance model. When it comes to details, there is no sure fire performance recipe for any kind of use case, but it certainly helps to study their general nature. Note that in real life, the use cases listed in this section may overlap each other.

User-facing software

The performance of user-facing applications is strongly linked to the user's anticipation. The difference of a good number of milliseconds may not be perceptible by the user, but at the same time, a wait of more than a few seconds may not be taken kindly. One important element to normalize the anticipation is to engage the user by providing duration-based feedback. A good idea to deal with such a scenario would be to start the task asynchronously in the background and poll it from the UI layer to generate duration-based feedback for the user. Another way could be to incrementally render the results to the user to even out the anticipation.

Anticipation is not the only factor in user-facing performance. Common techniques such as staging or pre-computation of data and other general optimization techniques can go a long way to improve the user experience with respect to performance. Bear in mind that all kinds of user-facing interfaces fall into this use case category: web, mobile web, GUI, command-line, touch, voice-operated, and gestures.

Computational and data-processing tasks

Non-trivial compute-intensive tasks demand a proportional amount of computational resources. All of the CPU, cache, memory, efficiency, and parallelizability of the computation algorithms would be involved in determining the performance. When the computation is combined with distribution over a network, or when reading from / staging to disk, I/O bound factors come into play. This class of workloads can be further subclassified into more specific use cases.

CPU bound

A CPU bound computation is limited by the CPU cycles spent on executing it. Processing arithmetic in a loop, small matrix multiplication, determining whether a number is Mersenne Prime, and so on would be considered CPU bound jobs. If the algorithm complexity is linked to N, such as O(N) and O(N2), then performance depends on how big N is and how many CPU cycles each step takes. For parallelizable algorithms, performance of such tasks may be enhanced by assigning multiple CPU cores to the task. On virtual hardware, performance may be impacted if CPU cycles are available in bursts.

Memory bound

A memory bound task is limited by the availability and bandwidth of a computer memory; examples include large text processing, list processing, and so on. Note that higher CPU resources cannot help when memory is in the bottleneck and vice versa. Lack of availability of memory may force you to process smaller chunks of data at a time, even if you have enough CPU resources at your disposal. If the maximum speed of your memory is X and your algorithm on single CPU-core accesses memory at a speed of X/3, the multicore performance of your algorithm cannot exceed 3 times the current performance, no matter how many CPU cores you assign to it. Memory architecture, for example SMP and NUMA, contributes to the memory bandwidth in multicore computers. Performance with respect to memory is also subject to page faults.

Cache bound

A task is cache bound when its speed is constrained by the amount of cache available. When a task retrieves values from a small number of repeated memory locations, for example small matrix multiplication, the values may be cached and fetched from there.

Note

Typically, CPUs have multiple layers of cache, and the performance will be at its best when the processed data fits in the cache. Processing will still happen, albeit slower, when the data does not fit into the cache . These will be covered in greater details in Chapter 4, Host Performance.

It is possible to make the most of the cache using cache-oblivious algorithms. A higher number of concurrent cache / memory bound threads than CPU cores is likely to flush the instruction pipeline, as well as the cache, at the time of a context switch.

Input/Output (I/O) bound

An I/O bound task would go faster if the I/O subsystem it depends on goes faster. Disk or storage as well as network are the most commonly used I/O subsystems in data processing. Other I/O devices are serial ports, a USB-connected card readers, and so on. An I/O bound task may consume very few CPU cycles. Depending on the speed of the device, connection pooling, data compression, asynchronous handling, caching, and so on may help in performance. One notable aspect of I/O bound tasks is that the performance is usually dependent on the time spent waiting for connection (or disk seek) and the amount of serialization we do, but hardly on the other resources.

In practice, many data processing workloads are usually a combination of CPU bound, memory bound, cache bound, and I/O bound tasks. The performance of such mixed workloads effectively depends on the even distribution of CPU, cache, memory, and I/O resources over the duration of the operation. While all system resources are finite, some I/O resources may be particularly limited in bandwidth and latency. A bottleneck situation arises only when one resource gets too busy to make way for another.

Online transaction processing (OLTP)

OLTP systems process business transactions on demand. It could work as a backend system for a user-facing ATM machine, a point-of-sale terminal, a network-connected ticket counter, an ERP system, and so on. OLTP systems are characterized by low latency, availability, and data integrity. OLTP systems run day-to-day business transactions. Any interruption or outage is likely to have a direct and immediate impact on the sales or service. Such systems are expected to be designed for resiliency rather than delayed recovery from failures. When the performance objective is unspecified, you may want to consider graceful degradation as a strategy.

It is a common mistake to ask OLTP systems to answer analytical queries, something that they are not optimized for. It is desirable of an informed programmer to know the capability of the system and suggest design changes as per the requirements.

Online analytical processing (OLAP)

OLAP systems are designed to answer analytical queries in a short time. They typically get data from OLTP operations and their data model is optimized for querying. OLAP systems basically provide for consolidation (roll-up), drill-down, and slicing and dicing of data for analytical purposes. They often use specialized data stores that can optimize ad-hoc analytical queries on the fly. It is important for such databases to provide pivot-table-like capability. Often, an OLAP cube is used to get faster access to analytical data.

Feeding OLTP data into OLAP systems may entail workflows and multistage batch processing. The performance concern of such systems is to efficiently deal with large quantities of data while also dealing with inevitable failures and recovery.

Batch processing

Batch processing is the automated execution of predefined jobs. These are typically bulk jobs and are executed during off-peak hours. Batch processing may involve one or more stages of job processing. Often, batch processing is clubbed with workflow automation, where some workflow steps are executed offline. Many of the batch processing tasks work on staging and preparing data for the next stage of processing to pick up.

Batch jobs are generally optimized for the utmost utilization of computing resources. Since there is little to moderate demand to lower latencies of particular subtasks, these systems tend to optimize for throughput. A lot of batch jobs involve large I/O processing, and they are often distributed over a cluster. Due to distribution, data locality is preferred when processing the jobs; that is, data and processing should be local in order to avoid network latency in reading/writing data.

Structured approach for performance

In practice, the performance of non-trivial applications is rarely a function of coincidence or prediction. For many projects, performance is not an option but rather compulsory, which is why this is even more important today. Capacity planning, determining performance objectives, performance modeling, measurement, and monitoring are crucial to achieving performance..

Tuning a poorly-designed system to perform as well as a system that is a well-designed system from the ground up is significantly hard, if not practically impossible. In order to meet a performance goal, performance objectives should be known before the application is designed. Performance objectives are stated in terms of latency, throughput, resource utilization, and workload. These terms are discussed in the Performance vocabulary section in this chapter.

The resource cost can be identified in terms of application scenarios, such as browsing of products, adding products to the shopping cart, and checkout. Creating workload profiles that represent users performing various operations is usually helpful.

Performance modeling is a reality check of whether the application design would support the performance objectives. It includes performance objectives, application scenarios, constraints, measurements (benchmark result), workload objectives, and, if available, the performance baseline. It is not a replacement of measurement and load testing, rather, the model is validated using these. The performance model may include performance test cases to assert the performance characteristics of the application scenarios.

Deploying an application to production almost always needs some form of capacity planning. It has to take into account the performance objectives for today and the foreseeable future. It requires an idea of application architecture and an understanding of how the external factors translate into internal workload. It also requires informed expectations about the responsiveness and the level of service to be provided by the system. Often, capacity planning is done early in a project to mitigate the risk of provisioning delays.