Book Image

Mastering Apache Cassandra - Second Edition

Book Image

Mastering Apache Cassandra - Second Edition

Overview of this book

Table of Contents (15 chapters)
Mastering Apache Cassandra Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Introduction to Cassandra


Quoting from Wikipedia:

"Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients."

Let's try to understand in detail what it means.

A distributed database

In computing, distributed means splitting data or tasks across multiple machines. In the context of Cassandra, it means that the data is distributed across multiple machines. It means that no single node (a machine in a cluster is usually called a node) holds all the data, but just a chunk of it. It means that you are not limited by the storage and processing capabilities of a single machine. If the data gets larger, add more machines. If you need more parallelism (ability to access data in parallel/concurrently), add more machines. This means that a node going down does not mean that all the data is lost (we will cover this issue soon).

If a distributed mechanism is well designed, it will scale with a number of nodes. Cassandra is one of the best examples of such a system. It scales almost linearly, with regard to performance, when we add new nodes. This means that Cassandra can handle the behemoth of data without wincing.

Note

Check out an excellent paper on the NoSQL database comparison titled, Solving Big Data Challenges for Enterprise Application Performance Management at http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf.

High availability

We will discuss availability in the next chapter. For now, assume availability is the probability that we query and the system just works. A high-availability system is one that is ready to serve any request at any time. High availability is usually achieved by adding redundancies. So, if one part fails, the other part of the system can serve the request. To a client, it seems as if everything works fine.

Cassandra is a robust software. Nodes joining and leaving are automatically taken care of. With proper settings, Cassandra can be made failure-resistant. This means that if some of the servers fail, the data loss will be zero. So, you can just deploy Cassandra over cheap commodity hardware or a cloud environment, where hardware or infrastructure failures may occur.

Replication

Continuing from the previous two points, Cassandra has a pretty powerful replication mechanism (we will see more details in the next chapter). Cassandra treats every node in the same manner. Data need not be written on a specific server (master), and you need not wait until the data is written to all the nodes that replicate this data (slaves). So, there is no master or slave in Cassandra, and replication happens asynchronously. This means that the client can be returned with success as a response as soon as the data is written on at least one server. We will see how we can tweak these settings to ensure the number of servers we want to have data written on before the client returns.

From this, we can derive that when there is no master or slave, we can write to any node for any operation. Since we have the ability to choose how many nodes to read from or write to, we can tweak it to achieve very low latency (read or write from one server).

Multiple data centers

Expanding from a single machine to a single data center cluster or multiple data centers is very simple compared to traditional databases where you need to make a plethora of configuration changes and watch replication. If you are planning to shard, it becomes a developer's nightmare. We will see later in this book that we can use this data center setting to make a real-time replicating system across data centers. We can use each data center to perform different tasks without overloading the other data centers. This is a powerful support when you do not have to worry whether users in Japan with a data center in Tokyo and users in the US with a data center in Virginia, are in sync or not.

These are just broad strokes of Cassandra's capabilities. We will explore more in the upcoming chapters. This chapter is about getting excited learning about Cassandra.