Book Image

IBM WebSphere eXtreme Scale 6

By : Anthony Chaves
Book Image

IBM WebSphere eXtreme Scale 6

By: Anthony Chaves

Overview of this book

A data grid is a means of combining computing resources. Data grids provide a way to distribute object storage and add capacity on demand in the form of CPU, memory, and network resources from additional servers. All three resource types play an important role in how fast data can be processed, and how much data can be processed at once. WebSphere eXtreme Scale provides a solution to scalability issues through caching and grid technology. Working with a data grid requires new approaches to writing highly scalable software; this book covers both the practical eXtreme Scale libraries and design patterns that will help you build scalable software. Starting with a blank slate, this book assumes you don't have experience with IBM WebSphere eXtreme Scale. It is a tutorial-style guide detailing the installation of WebSphere eXtreme Scale right through to using the developer libraries. It covers installation and configuration, and discusses the reasons why a data grid is a viable middleware layer. It also covers many different ways of interacting with objects in eXtreme Scale. It will also show you how to use eXtreme Scale in new projects, and integrate it with relational databases and existing applications. This book covers the ObjectMap, Entity, and Query APIs for interacting with objects in the grid. It shows client/server configurations and interactions, as well as the powerful DataGrid API. DataGrid allows us to send code into the grid, which can be run where the data lives. Equally important are the design patterns that go alongside using a data grid. This book covers the major concepts you need to know that prevent your client application from becoming a performance bottleneck. By the end of the book, you'll be able to write software using the eXtreme Scale APIs, and take advantage of a linearly scalable middleware layer.
Table of Contents (15 chapters)
IBM WebSphere eXtreme Scale 6
Credits
About the Author
About the Reviewers
Preface

Data grid basics


One part of a data grid is the object cache. An object cache stores the serialized form of Java objects in memory. This approach is an alternative to the most common form of using a relational database for storage. A relational database stores data in column form, and needs object-relational mapping to turn objects into tuples and back again. An object cache only deals with Java objects and requires no mapping to use. A class must be serializeable though.

Caching objects is done using key/value tables that look like a hash table data structure. In eXtreme Scale terminology, this hash table data structure is a class that implements the com.ibm.websphere.objectgrid.BackingMap interface. A BackingMap can work like a simple java.util.Map, used within one application process. It can also be partitioned across many dedicated eXtreme Scale processes. The APIs for working with an unpartitioned BackingMap and a partitioned BackingMap are the same, which makes learning how to use eXtreme Scale easy. The programming interface is the same whether our application is made up of one process or many.

Using a data grid in our software requires some trade-offs. With the great performance of caching objects in memory, we still need to be aware of the consequences of our decisions. In some cases, we trade faster performance for predictable scalability. One of the most important factors driving data grid adoption is predictable scalability in working with growing data sets and more simultaneous client applications.

An important feature of data grids that separates them from simple caches is database integration. Even though the object cache part of a data grid can be used as primary storage, it's often useful to integrate with a relational database. One reason we want to do this is that reporting tools based on RDBMS's are far more capable than reporting tools for data grids today. This may change in the coming years, but right now, we use reporting tools tied in to a database.

WXS uses Loaders to integrate with databases. Though not limited to databases, Loaders are most commonly used to integrate with a database. A Loader can take an object in the object cache and call an existing ORM framework that transforms an object and saves it to a database. Using a Loader makes saving an object to a database transparent to the data grid client. When the client puts the object into the object cache, the Loader pushes the object through the ORM framework behind the scenes. If you are writing to the cache, then the database is a thing of the past.

Using a Loader can make the object cache the primary point of object read/write operations in an application. This greatly reduces the load on a database server by making the cache act as a shock absorber. Finding an object is as simple as looking it up in the cache. If it's not there, then the Loader looks for it in the database. Writing objects to the cache may not touch the database in the course of the transaction. Instead, a Loader can store updated objects and then batch update the database after a certain period of time or after certain number of objects are written to the cache. Adding a data grid between an application and database can help the database serve more clients when those clients are eXtreme Scale clients since the load is not directly on the database server:

This topology is in contrast to one where the database is used directly by client applications. In the following topology the limiting factor in the number of simultaneous clients is the database.

Applications can start up, load a grid full of data, and then shut down while the data in the grid remains there for use by another application. Applications can put objects in the grid for caching purposes and remove them upon application completion. Or, the application can leave them and those objects will far outlive the process that placed them in the grid.

Notice how we are dealing with Java objects. Our cache is a key/value store where keys and values are POJOs. In contrast, a simple cache may limit keys and values to strings. An object in a data grid cache is the serialized form of our Java object. Putting an object from our application into the cache only requires serialization. Mapping to a data grid specific type is not required, nor does the object require a transform layer. Getting an object out of the cache is just as easy. An object need only be deserialized once in the client application process. It is ready for use upon deserialization and does not require any transformation or mapping before use. This is in contrast to persisting an object by using an ORM framework where the framework generates a series of SQL queries in order to save or load the object state. By storing our objects in the grid, we also free ourselves from calling our ORM to save the objects to the database if we choose. We can use the data grid cache as our primary data store or we can take advantage of the database integration features of eXtreme Scale and have the grid write our objects to the database for us.

Data grids typically don't use hard disks or tapes for storing objects. Instead, they store objects in the memory, which may seem obvious based on the name in-memory data grid. Storing objects in the memory has the advantage of keeping objects in a location with much lower access time compared to physical storage. A network hop to connect to a database is going to take the same amount of time as a network hop to a data grid instance. The remote server storing or retrieving of the data from the grid is much faster than the equivalent operation on a database due to the nature of the storage medium. A network hop is required in a distributed deployment. This means that an object isn't initially available in the same address space where it will be used. This is one of those trade-offs mentioned earlier. We trade initial locality of reference for predictable performance over a large data set. What works for caching small data sets may not be a good idea when caching large data sets.

Though the access time of storing objects in memory is an advantage over a database hit, it's hardly a new concept. Developers have been creating in-memory caches for a long time. Looking at a single-threaded application, we may have the cache implemented as a simple hash-map (see below). Examples of things we might cache are objects that result from CPU-intensive calculations. By caching the result, we save ourselves the trouble of recalculating it again the next time it is needed. Another good candidate for caching is storing large amounts of read-only data.

In a single-threaded application, we have one cache available to put data. The amount of data that fits in our cache is limited by the amount of JVM heap size available to it. Depending on the JVM settings, garbage collection may become an issue if large numbers of objects are frequently removed from the cache and go out of the application scope. However, this typically isn't an issue.

This cache is located in the same address space and thread as the code that operates on objects in the cache. Cached objects are local to the application, and accessing them is about as fast as we can get. This works great for data sets that fit in the available heap space, and when no other processes need to access these cached objects.

Building multi-threaded applications changed the scope of the cache a little bit. In a single-threaded application, we have one cache per thread. As we introduce more threads, this method will not continue to work for long:

Each cache contains the same key/value pairs as the other caches.

As each of our N threads has its own cache, the JVM heap size must now be shared among N caches. The most prominent problem with this method of caching is that data will be duplicated in multiple caches. Loading data into one cache will not load it into the others. Depending on the eviction policy, we could end up with a cache hit rate that is close to 0 percent over time. Rather than maintaining a cache per thread, developers started to use a singleton cache:

The singleton cache is protected by the Singleton design pattern in Java. The Singleton design pattern (almost) assures us that only one instance of a particular object exists in the JVM, and it also provides us with a canonical reference to that object. In this way, we can create a hash-map to act as our cache if one doesn't already exist, and then get that same instance every time we look for it. With one cache, we won't duplicate data in multiple caches and each thread has access to all of the data in the cache.

With the introduction of the java.util.concurrent package, developers have safer options available for caching objects between multiple threads. Again, these strategies work best for data sets that fit comfortably in one heap. Running multiple processes of the same application will cache the same data in each process:

What if our application continues to scale out to 20 running instances to do the processing for us? We're once again in the position of maintaining multiple caches that contain the same data set (or subset of the same data set). When we have a large data set that does not fit in one heap, our cache hit rate may approach 0 percent over time. Each application instance cache can be thought of as a very small window into the entire data set. As each instance sees only a small portion of the data set, our cache hit rate per instance is lower than an application with a larger window into the data set. Locality of reference for an object most likely requires a database hit to get the object and then cache it. As our locality of reference is already poor, we may want to insert a shared cache to provide a larger window into the data set. Getting an object from an object cache is faster than retrieving the object from a database, provided that object is already cached.

What we really want is an object cache where any thread in any application process can access the data. We need something that looks like this:

A data grid is made up of many different processes running on different servers. These processes are data grid, or eXtreme Scale processes, not our application processes. For each eXtreme Scale process, we have one more JVM heap available to an object cache. eXtreme Scale handles the hard work of distributing objects across the different data grid processes, making our cache look like one large logical cache, instead of many small caches. This provides the largest window possible into our large data set. Caching more objects is as simple as starting more eXtreme Scale processes on additional servers.

We still have the same number of application instances, but now the cache is not stored inside the application process. It's no longer a hash-map living inside the same JVM alongside the business logic, nor is it stored in an RDBMS. Instead, we have conscripted several computers to donate their memory. This lets us create a distributed cache reachable by any of our application instances. Though the cache is distributed across several computers, there is no data duplication. The data is still stored as a map where the keys are stored across different partitions, such that the data is distributed as evenly as possible.

When using an object cache, the goal is to provide a window as large as possible into a large data set. We want to cache as much data as we can in memory so that any application can access it. We accept that this is slower than caching locally because using a small cache does not produce acceptable cache hit rates. A network hop is a hop, whether it is to connect to a database or data grid. A distributed object cache needs to be faster than a database for read and write operations, only after paying the overhead of making a network connection.

Each partition in a distributed object cache holds a subset of the keys that our applications use for objects. No cache partition stores all of the keys. Instead, eXtreme Scale determines which partition to store an object in, based on it's key. Again, the hard work is handled by eXtreme Scale. We don't need to have knowledge of which partition an object is stored in, or how to connect to that partition. We interact with the object cache as if it were a java.util.Map and eXtreme Scale handles the rest:

In-memory data grids can do a lot more than object caching, though that's the use we will explore first. Throughout this book, we will explore additional features that make up a data grid and put them to use in several sample applications.