Book Image

Learning Apache Cassandra - Second Edition

Book Image

Learning Apache Cassandra - Second Edition

Overview of this book

Cassandra is a distributed database that stands out thanks to its robust feature set and intuitive interface, while providing high availability and scalability of a distributed data store. This book will introduce you to the rich feature set offered by Cassandra, and empower you to create and manage a highly scalable, performant and fault-tolerant database layer. The book starts by explaining the new features implemented in Cassandra 3.x and get you set up with Cassandra. Then you’ll walk through data modeling in Cassandra and the rich feature set available to design a flexible schema. Next you’ll learn to create tables with composite partition keys, collections and user-defined types and get to know different methods to avoid denormalization of data. You will then proceed to create user-defined functions and aggregates in Cassandra. Then, you will set up a multi node cluster and see how the dynamics of Cassandra change with it. Finally, you will implement some application-level optimizations using a Java client. By the end of this book, you'll be fully equipped to build powerful, scalable Cassandra database layers for your applications.
Table of Contents (14 chapters)

Data distribution in Cassandra


In a traditional relational database such as MySQL or PostgreSQL, the entire contents of the database reside on a single machine. At a certain scale, the hardware capacity of the server running the database becomes a constraint: simply migrating to more powerful hardware will lead to diminishing returns.

Let's imagine ourselves in this scenario, where we have an application running on a single-machine database that has reached the limits of its capacity to vertically scale. In that case, we'll want to split the data between multiple machines, a process known as sharding or federation. Assuming we want to stick with the same underlying tool, we'll end up with multiple database instances, each of which holds a subset of our total data. Crucially, in this scenario, the different database instances have no knowledge of each other; as far as each instance is concerned, it's simply a standalone database containing a standalone dataset.

It's up to our application to...