Book Image

Oracle Coherence 3.5

By : Aleksandar Seovic
Book Image

Oracle Coherence 3.5

By: Aleksandar Seovic

Overview of this book

Scalability, performance, and reliability have to be designed into an application from the very beginning, as there may be substantial cost or implementation consequences if they need to be added down the line. This indispensible book will teach you how to achieve these things using Oracle Coherence, a leading data grid product on the market.Authored by leading Oracle Coherence authorities, this essential book will teach you how to use Oracle Coherence to build high-performance applications that scale to hundreds of machines and have no single points of failure. You will learn when and how to use Coherence features such as distributed caching, parallel processing, and real-time events within your application, and understand how Coherence fits into the overall application architecture. Oracle Coherence provides a solid architectural foundation for scalable, high-performance and highly available enterprise applications, through features such as distributed caching, parallel processing, distributed queries and aggregations, real-time events, and the elimination of single points of failure.However, in order to take full advantage of these features, you need to design your application for Coherence from the beginning. Based on the authors' extensive knowledge of Oracle Coherence, and how to use it in the real world, this book will provide you with all the information you need in order to leverage various Coherence features properly. It contains a collection of best practice-based solutions and mini-frameworks that will allow you to be more productive from the very beginning.The early chapters cover basics like installation guidelines and caching topologies, before moving on to the domain model implementation guidelines, distributed queries and aggregations, parallel processing, and real-time events. Towards the end, you learn how to integrate Coherence with different persistence technologies, how to access Coherence from platforms other than Java, and how to test and debug classes and applications that depend on Coherence.
Table of Contents (22 chapters)
Oracle Coherence 3.5
Credits
Foreword
About the author
Acknowledgements
About the co-authors
About the reviewers
Preface
12
The Right Tool for the Job
Index

Achieving high availability


The last thing we need to talk about is availability. At the most basic level, in order to make the application highly available we need to remove all single points of failure from the architecture. In order to do that, we need to treat every single component as unreliable and assume that it will fail sooner or later.

It is important to realize that the availability of the system as a whole can be defined as the product of the availability of its tightly coupled components:

AS = A1 * A2 * ...* An

For example, if we have a web server, an application server, and a database server, each of which is available 99% of the time, the expected availability of the system as a whole is only 97%:

0.99 * 0.99 * 0.99 = 0.970299 = 97%

This reflects the fact that if any of the three components should fail, the system as a whole will fail as well. By the way, if you think that 97% availability is not too bad, consider this: 97% availability implies that the system will be out of commission 11 days every year, or almost one day a month!

We can do two things to improve the situation:

  • We can add redundancy to each component to improve its availability.

  • We can decouple components in order to better isolate the rest of the system from the failure of an individual component.

The latter is typically achieved by introducing asynchrony into the system. For example, you can use messaging to decouple a credit card processing component from the main application flow—this will allow you to accept new orders even if the credit card processor is temporarily unavailable.

As mentioned earlier, Coherence is able to queue updates for a failed database and write them asynchronously when the database becomes available. This is another good example of using asynchrony to provide high availability.

Although the asynchronous operations are a great way to improve both availability and scalability of the application, as well as perceived performance, there is a limit to the number of tasks that can be performed asynchronously in a typical application. If the customer wants to see product information, you will have to retrieve the product from the data store, render the page, and send the response to the client synchronously.

To make synchronous operations highly available our only option is to make each component redundant.

Adding redundancy to the system

In order to explain how redundancy helps improve availability of a single component, we need to introduce another obligatory formula or two (I promise this is the only chapter you will see any formulas in):

F = F1 * F2 * ... * Fn

Where F is the likelihood of failure of a redundant set of components as a whole, and F1 through Fn are the likelihoods of failure of individual components, which can be expressed as:

Fc = 1 – Ac

Going back to our previous example, if the availability of a single server is 99%, the likelihood it will fail is 1%:

Fc = 1 – 0.99 = 0.01

If we make each layer in our architecture redundant by adding another server to it, we can calculate new availability for each component and the system as a whole:

Ac = 1 - (0.01 * 0.01) = 1 – 0.0001 = 0.9999 = 99.99%

As = 0.9999 * 0.9999 * 0.9999 = 0.9997 = 99.97%

Basically, by adding redundancy to each layer, we have reduced the application's downtime from 11 days to approximately two and a half hours per year, which is not nearly as bad.

Redundancy is not enough

Making components redundant is only the first step on the road to high availability. To get to the finish line, we also need to ensure that the system has enough capacity to handle the failure under the peak load.

Developers often assume that if an application uses scale-out architecture for the application tier and a clustered database for persistence, it is automatically highly available. Unfortunately, this is not the case.

If you determine during load testing that you need N servers to handle the peak load, and you would like the system to remain operational even if X servers fail at the same time, you need to provision the system with N+X servers. Otherwise, if the failure occurs during the peak period, the remaining servers will not be able to handle the incoming requests and either or both of the following will happen:

  • The response time will increase significantly, making performance unacceptable

  • Some users will receive "500 - Service Busy" errors from the web server

In either case, the application is essentially not available to the users.

To illustrate this, let's assume that we need five servers to handle the peak load. If we provision the system with only five servers and one of them fails, the system as a whole will fail. Essentially, by not provisioning excess capacity to allow for failure, we are turning "application will fail if all 5 servers fail" into "application will fail if any of the 5 servers fail". The difference is huge—in the former scenario, assuming 99% availability of individual servers, system availability is almost 100%. However, in the latter it is only 95%, which translates to more than 18 days of downtime per year.

Coherence and availability

Oracle Coherence provides an excellent foundation for highly available architecture. It is designed for availability; it assumes that any node can fail at any point in time and guards against the negative impact of such failures.

This is achieved by data redundancy within the cluster. Every object stored in the Coherence cache is automatically backed up to another node in the cluster. If a node fails, the backup copy is simply promoted to the primary and a new backup copy is created on another node.

This implies that updating an object in the cluster has some overhead in order to guarantee data consistency. The cache update is not considered successful until the backup copy is safely stored. However, unlike clustered databases that essentially lock the whole cluster to perform write operations, the cost of write operations with Coherence is constant regardless of the cluster size. This allows for exceptional scalability of both read and write operations and very high throughput.

However, as we discussed in the previous section, sizing Coherence clusters properly is extremely important. If the system is running at full capacity, failure of a single node could have a ripple effect. It would cause other nodes to run out of memory as they tried to fail over the data managed by the failed node, which would eventually bring the whole cluster down.

It is also important to understand that, although your Coherence cluster is highly available that doesn't automatically make your application as a whole highly available as well. You need to identify and remove the remaining single points of failure by making sure that your hardware devices such as load balancers, routers, and network switches are redundant, and that your database server is redundant as well. The good news is that if you use Coherence to scale the data tier and reduce the load on the database, making the database redundant will likely be much easier and cheaper.

As a side note, while there are many stories that can be used as a testament to Coherence's availability, including the one when the database server went down over the weekend without the application users noticing anything, my favorite is an anecdote about a recent communication between Coherence support team and a long-time Coherence user.

This particular customer has been using Coherence for almost 5 years. When a bug was discovered that affects the particular release they were using, the support team sent the patch and the installation instructions to the customer. They received a polite reply:

You have got to be kidding me!? We haven't had a second of downtime in the last 5 years. We are not touching a thing!