Mastering Apache Cassandra

Cassandra has three containers, one within another. The outermost container is keyspace. You can think of keyspace as a database in the RDBMS land. Tables reside under keyspace. A table can be assumed as a relational database table, except it is more flexible. A table is basically a sorted map of sorted maps (refer to the following figure). Each table must have a primary key. This primary key is called row key or partition key. (We will later see that in a CQL table, the row key is the same as the primary key. If the primary key is made up of more than one column, the first component of this composite key is equivalent to the row key). Each partition is associated with a set of cells. Each cell has a name and a value. These cells may be thought of as columns in the traditional database system. The CQL engine interprets a group of cells with the same cell name prefix as a row. The following figure shows the Cassandra data model:

Note that if you come with Cassandra Thrift experience, it might be hard to view how Cassandra 1.2 and newer versions have changed terminology. Before CQL, the tables were called column families. A column family holds a group of rows, and rows are a sorted set of columns.

One obvious benefit of having such a flexible data storage mechanism is that you can have arbitrary number of cells with customized names and have a partition key store data as a list of tuples (a tuple is an ordered set; in this case, the tuple is a key-value pair). This comes handy when you have to store things such as time series, for example, if you want to use Cassandra to store your Facebook timeline or your Twitter feed or you want the partition key to be a sensor ID and each cell to represent a tuple with name as the timestamp when the data was created and value as the data sent by the sensor. Also, in a partition, cells are by default naturally ordered by the cell's name. So, in our sensor case, you will get data sorted for free. The other difference is, unlike RDBMS, Cassandra does not have relations. This means relational logic will be needed to be handled at the application level. This also means that we may want to denormalize the database because there is no join and to avoid looking up multiple tables by running multiple queries. Denormalization is a process of adding redundancy in data to achieve high read performance. For more information, visit http://en.wikipedia.org/wiki/Denormalization.

Partitions are distributed across the cluster, creating effective auto-sharding. Each server holds a range(s) of keys. So, if balanced, a cluster with more nodes will have less rows per node. All these concepts will be repeated in detail in the later chapters.

Note

Types of keys

In the context of Cassandra, you may find the concept of keys a bit confusing. There are five terms that you may encounter. Here is what they generally mean:

Primary key: This is the column or a group of columns that uniquely defines a row of the CQL table.
Composite key: This is a type of primary key that is made up of more than one column. Sometimes, the composite key is also referred to as the compound key.
Partition key: Cassandra's internal data representation is large rows with a unique key called row key. It uses these row key values to distribute data across cluster nodes. Since these row keys are used to partition data, they as called partition keys. When you define a table with a simple key, that key is the partition key. If you define a table with a composite key, the first term of that composite key works as the partition key. This means all the CQL rows with the same partition key lives on one machine.
Clustering key: This is the column that tells Cassandra how the data within a partition is ordered (or clustered). This essentially provides presorted retrieval if you know what order you want your data to be retrieve in.
Composite partition key: Optionally, CQL lets you define a composite partition key (the first part of a composite key). This key helps you distribute data across nodes if any part of the composite partition key differs. Let's take a look at the following example:

CREATE TABLE customers (
  id uuid,
  email text,
  PRIMARY KEY (id)
)

In the preceding example, id is the primary key and also the partition key. There is no clustering. It is a simple key. Let's add a twist to the primary key:

CREATE TABLE country_states (
  country text,
  state text,
  population int,
  PRIMARY KEY (country, state)
)

In the preceding example, we have a composite key that uses country and state to uniquely define a CQL row. The country column is the partition key, so all the rows with the same country node will belong to the same node/machine. The rows within a partition will be sorted by the state names. So, when you query for states in the US, you will encounter the row with California before the one with New York. What if I want to partition by composition? Let's take a look at the following example:

CREATE TABLE country_chiefs (
  country text,
  prez_name text,
  num_states int,
  capital text,
  ruling_year int,
  PRIMARY KEY ((country, prez_name), num_states, capital)
)

The preceding example has a composite key involving four columns: country, prez_name, num_states, and capital, with country and prez_name constituting composite partition key. This means the rows with the same country but different president will be in a different partition. Rows will be ordered by the number of states followed by the capital name.

Mastering Apache Cassandra - Second Edition

Mastering Apache Cassandra - Second Edition

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Apache Cassandra - Second Edition

A brief introduction to a data model

Note