Book Image

Mastering Apache Cassandra 3.x - Third Edition

By : Aaron Ploetz, Tejaswi Malepati, Nishant Neeraj
Book Image

Mastering Apache Cassandra 3.x - Third Edition

By: Aaron Ploetz, Tejaswi Malepati, Nishant Neeraj

Overview of this book

With ever-increasing rates of data creation, the demand for storing data fast and reliably becomes a need. Apache Cassandra is the perfect choice for building fault-tolerant and scalable databases. Mastering Apache Cassandra 3.x teaches you how to build and architect your clusters, configure and work with your nodes, and program in a high-throughput environment, helping you understand the power of Cassandra as per the new features. Once you’ve covered a brief recap of the basics, you’ll move on to deploying and monitoring a production setup and optimizing and integrating it with other software. You’ll work with the advanced features of CQL and the new storage engine in order to understand how they function on the server-side. You’ll explore the integration and interaction of Cassandra components, followed by discovering features such as token allocation algorithm, CQL3, vnodes, lightweight transactions, and data modelling in detail. Last but not least you will get to grips with Apache Spark. By the end of this book, you’ll be able to analyse big data, and build and manage high-performance databases for your application.
Table of Contents (12 chapters)

Introduction to Cassandra

Apache Cassandra is a highly available, distributed, partitioned row store. It is one of the more popular NoSQL databases used by both small and large companies all over the world to store and efficiently retrieve large amounts of data. While there are licensed, proprietary versions available (which include enterprise support), Cassandra is also a top-level project of the Apache Software Foundation, and has deep roots in the open source community. This makes Cassandra a proven and battle-tested approach to scaling high-throughput applications.

High availability

Cassandra's design is premised on the points outlined in the Dynamo: Amazon's Highly Available Key-value Store paper (https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf). Specifically, when you have large networks of interconnected hardware, something is always in a state of failure. In reality, every piece of hardware being in a healthy state is the exception, rather than the rule. Therefore, it is important that a data storage system is able to deal with (and account for) issues such as network or disk failure.

Depending on the Replication Factor (RF) and required consistency level, a Cassandra cluster is capable of sustaining operations with one or two nodes in a failure state. For example, let's assume that a cluster with a single data center has a keyspace configured for a RF of three. This means that the cluster contains three copies of each row of data. If an application queries with a consistency level of one, then it can still function properly with one or two nodes in a down state.

Distributed

Cassandra is known as a distributed database. A Cassandra cluster is a collection of nodes (individual instances running Cassandra) all working together to serve the same dataset. Nodes can also be grouped together into logical data centers. This is useful for providing data locality for an application or service layer, as well as for working with Cassandra instances that have been deployed in different regions of a public cloud.

Cassandra clusters can scale to suit both expanding disk footprint and higher operational throughput. Essentially, this means that each cluster becomes responsible for a smaller percentage of the total data size. Assuming that the 500 GB disks of a six node cluster (RF of three) start to reach their maximum capacity, then adding three more nodes (for a total of nine) accomplishes the following:

  • Brings the total disk available to the cluster up from 3 TB to 4.5 TB
  • The percentage of data that each node is responsible for drops from 50% down to 33%

Additionally, let's assume that before the expansion of the cluster (from the prior example), the cluster was capable of supporting 5,000 operations per second. Cassandra scales linearly to support operational throughput. After increasing the cluster from six nodes to nine, the cluster should then be expected to support 7,500 operations per second.

Partitioned row store

In Cassandra, rows of data are stored in tables based on the hashed value of the partition key, called a token. Each node in the cluster is assigned multiple token ranges, and rows are stored on nodes that are responsible for their tokens.

Each keyspace (collection of tables) can be assigned a RF. The RF designates how many copies of each row should be stored in each data center. If a keyspace has a RF of three, then each node is assigned primary, secondary, and tertiary token ranges. As data is written, it is written to all of the nodes that are responsible for its token.