Book Image

Cassandra Design Patterns - Second Edition

By : Rajanarayanan Thottuvaikkatumana
Book Image

Cassandra Design Patterns - Second Edition

By: Rajanarayanan Thottuvaikkatumana

Overview of this book

If you are new to Cassandra but well-versed in RDBMS modeling and design, then it is natural to model data in the same way in Cassandra, resulting in poorly performing applications and losing the real purpose of Cassandra. If you want to learn to make the most of Cassandra, this book is for you. This book starts with strategies to integrate Cassandra with other legacy data stores and progresses to the ways in which a migration from RDBMS to Cassandra can be accomplished. The journey continues with ideas to migrate data from cache solutions to Cassandra. With this, the stage is set and the book moves on to some of the most commonly seen problems in applications when dealing with consistency, availability, and partition tolerance guarantees. Cassandra is exceptionally good at dealing with temporal data and patterns such as the time-series pattern and log pattern, which are covered next. Many NoSQL data stores fail miserably when a huge amount of data is read for analytical purposes, but Cassandra is different in this regard. Keeping analytical needs in mind, you’ll walk through different and interesting design patterns. No theoretical discussions are complete without a good set of use cases to which the knowledge gained can be applied, so the book concludes with a set of use cases you can apply the patterns you’ve learned.
Table of Contents (15 chapters)

A brief overview of Cassandra


Where do you start with Cassandra? The best place is to look at the new application development requirements and take it from there. Look at cases where there is a need to denormalize the RDBMS tables and keep all the data items together, which would have been distributed if you were to design the same solution in an RDBMS. If an application is writing a set of data items together into a data store, why do you want to separate them out? No need to worry about redundancy. This is the new NoSQL philosophy. This is the new way to look at data modeling in NoSQL data stores. Cassandra supports fast writes and reads. Initial versions of Cassandra had some performance problems, but a huge number of optimizations have gone into making the latest version of Cassandra perform much better for reads as well as writes. There is no problem with consuming space because the secondary storage is getting cheaper and cheaper. A word of caution here is that, it is fine to write the data into Cassandra, whatever the level of redundancy, but the data access use cases have to be thought through carefully before getting involved in the Cassandra data model. The data is stored in the disk, to be read at a later date. These reads have to be efficient, and it gives the required data in the desired sorted order.

In a nutshell, you should decide how do you want to store the data and make sure that it is giving you the data in the desired sort order. There is no hard and fast rule for this. It is purely up to the application requirements. That is, the other shift in the thought process.

Instead of thinking from the pure data model perspective, start thinking in terms of the application's perspective. How the data is generated by the application, what are the read requirements, what are the write requirements, what is the response time expected out of some of the use cases, and so on. Depending on these aspects, design the data model. In the big data world, the application becomes the first class citizen and the data model leaves the driving seat in the application design. Design the data model to serve the needs of the applications.

In any organization, new reporting requirements come all the time. The major challenge to generate reports is the underlying data store. In the RDBMS world, reporting is always a challenge. You may have to join multiple tables to generate even simple reports. Even though the RDBMS objects such as views, stored procedures, and indexes may be used to get the desired data for the reports, when the report is being generated, the query plan is going to be very complex most of the time. Consumption of the processing power is another need to consider when generating such reports on the fly. Because of these complexities, many times, for reporting requirements, it is common to keep separate tables containing data exported from the transactional tables. Martin Fowler emphasizes the need for separating reporting data from the operations data in his article, Reporting Database. He states:

"Most Enterprise Applications store persistent data with a database. This database supports operational updates of the application's state, and also various reports used for decision support and analysis. The operational needs and the reporting needs are, however, often quite different - with different requirements from a schema and different data access patterns. When this happens it's often a wise idea to separate the reporting needs into a reporting database, which takes a copy of the essential operational data but represents it in a different schema".

This is a great opportunity to start with NoSQL stores such as Cassandra as a reporting data store.

Data aggregation and summarization are the common requirements in any organization. This helps to control data growth in by storing only the summary statistics and moving the transactional data into archives. Often, these aggregated and summarized data are used for statistical analysis. In many websites, you can see the summary of your data instantaneously when you log in to the site or when you perform transactions. Some of the examples include the available credit limit of credit cards, the available number of text messages, remaining international call minutes in a mobile phone account, and so on. Making the summary accurate and easily accessible is a big challenge. Most of the time, data aggregation and reporting go hand in hand. The aggregated data is used heavily in reports. The aggregation process speeds up the queries to a great extent. In RDBMS, it is always a challenge to aggregate data, and you can find new requirements coming all the time. This is another place you can start with NoSQL stores such as Cassandra.

Now, we are going to discuss some aspects of the denormalization, reporting, and aggregation of data using Cassandra as the preferred NoSQL data store.