Book Image

Getting Started with CockroachDB

By : Kishen Das Kondabagilu Rajanna
Book Image

Getting Started with CockroachDB

By: Kishen Das Kondabagilu Rajanna

Overview of this book

Getting Started with CockroachDB will introduce you to the inner workings of CockroachDB and help you to understand how it provides faster access to distributed data through a SQL interface. The book will also uncover how you can use the database to provide solutions where the data is highly available. Starting with CockroachDB's installation, setup, and configuration, this SQL book will familiarize you with the database architecture and database design principles. You'll then discover several options that CockroachDB provides to store multiple copies of your data to ensure fast data access. The book covers the internals of CockroachDB, how to deploy and manage it on the cloud, performance tuning to get the best out of CockroachDB, and how to scale data across continents and serve it locally. In addition to this, you'll get to grips with fault tolerance and auto-rebalancing, how indexes work, and the CockroachDB Admin UI. The book will guide you in building scalable cloud services on top of CockroachDB, covering administrative and security aspects and tips for troubleshooting, performance enhancements, and a brief guideline on migrating from traditional databases. By the end of this book, you'll have gained sufficient knowledge to manage your data on CockroachDB and interact with it from your application layer.
Table of Contents (17 chapters)
1
Section 1: Getting to Know CockroachDB
4
Section 2: Exploring the Important Features of CockroachDB
9
Section 3: Working with CockroachDB
Appendix: Bibliography and Additional Resources

Database concepts

In this section, we will learn about some of the core database concepts, including cardinality, database models, and various processing models.

Cardinality

Before we discuss database models, it is important to know about cardinality. Cardinality refers to the relationship between two entities or tables. The most popular ones include one-to-many, many-to-one, and many-to-many.

One-to-one relationship

In the case of a one-to-one relationship, a row or entry in one entity or table can be related to only one row in another entity or table. For example, in a Department of Motor Vehicles database, let's say there are two tables called License Info and Driver Info, as shown in the following diagram:

Figure 1.1 – An example of a one-to-one relationship

Figure 1.1 – An example of a one-to-one relationship

Here, Driver ID can only be assigned to one driver as it has to uniquely identify a driver. Also, a driver can only be assigned one Driver ID. So, here, any row in the License Info table will be associated with a specific row in the Driver Info table.

One-to-many relationship

In a one-to-many relationship, a single row from one entity or table can be associated with multiple rows in another entity or table.

For example, let's consider the Driver Info and City Info tables shown in the following diagram::

Figure 1.2 – An example of a one-to-many relationship

Figure 1.2 – An example of a one-to-many relationship

Here, for every row in City Info, there will be multiple rows in Driver Info, as there can be many drivers that live in a particular city.

Many-to-many relationship

In a many-to-many relationship, a single row in one entity or table can be associated with multiple rows in another entity or table and vice versa.

For example, let's consider two tables: Vehicle Ownership History, where we are maintaining the history of ownership of a given vehicle, and Driver Ownership History, where we are maintaining the history of vehicles owned by a given driver:

Figure 1.3 – An example of a many-to-many relationship

Figure 1.3 – An example of a many-to-many relationship

Here, a driver can own multiple vehicles and a vehicle can have multiple owners over time. So, a given row in the Vehicle Ownership History table can be associated with multiple rows in the Driver Ownership History table. Similarly, a given row in the Driver Ownership History table can be associated with multiple rows in the Vehicle Ownership History table.

Now, let's take a look at some of the most important database models.

Overview of database models

A database model determines how the data is stored, organized, and modified. Databases are typically implemented based on a specific data model. It is also possible to borrow concepts from multiple database models when you are designing a new database. The relational database model happens to be the most widely known and has been popularized by databases such as Oracle, IBM DB2, and MySQL.

Hierarchical database model

In the hierarchical database model, the data is organized in the form of a tree. There is a root at the first level and multiple children at the subsequent levels. Since a single parent can have multiple children, one-to-many relationships can easily be represented here. A child cannot have multiple parents, so this results in the advantage of not being able to model many-to-many relationships.

IBM's Information Management System (IMS) was the first database that implemented this data model.

The following diagram shows an example of a hierarchical database model:

Figure 1.4 – An example of a hierarchical database model

Figure 1.4 – An example of a hierarchical database model

Typically, the tree starts with a single root and the data is organized into this tree. Any node except the leaves can have multiple children, but a child can have only one parent.

Network model

The network model was developed as an enhancement of the hierarchical database model to accommodate many-to-many relationships. The network model relies on a graph structure to organize its data. So, there is no concept of a single root, and a child can have multiple parents and a parent can have multiple children. Integrated Data Store (IDS), Integrated Database Management Systems (IDMS), and Raima Database Manager (RDM) are some of the popular databases that use the network model.

As shown in the following diagram, there is no single root and a given child (for example, Object 2 can have multiple parents; that is, Object 1 and Object 3):

Figure 1.5 – An example of a network model

Figure 1.5 – An example of a network model

Relational model

Although the network model was an improvement over the hierarchical model, it was still a little restrictive when it came to representing data. In the relational model, any record can have a relationship with any other with the help of a common field. This drastically reduced the design's complexity and made it easier to independently add, update, and access records, without having to walk down the tree or traverse the graph. SQL was combined with the relational database model to provide a simple query interface to add and retrieve data.

All the popular traditional databases such as Oracle database, IBM DB2, MySQL, MariaDB, and Microsoft SQL Server implement relational data models.

Let's look at two tables called Employee and Employee Info:

Figure 1.6 – Employee tables showing the column names

Figure 1.6 – Employee tables showing the column names

Here, Employee ID is the common field or column between the Employee and Employee Info tables. The Employee table is responsible for ensuring that a given Employee ID is unique, while Employee Info is responsible for more detailed information about a given employee.

Object-relational model

The object-relational model, as the name suggests, combines the best of the relational and object data models. The concept of objects, classes, and inheritance are directly supported as first-class citizens as part of the database and in queries. SQL:1999, the fourth revision of SQL, introduced several features for embedding object concepts into the relational database. One of the main features was to create structured user-defined types with CREATE TYPE to define an object's structure.

Over time, relational databases have added more support for objects. There is a varying degree of support for object concepts in Oracle database, IBM DB2, PostgreSQL, and Microsoft SQL Server.

Given the scope of this book, we will not discuss the entity-relational model, object model, document model, star schema, snowflake schema, and many other less well-known models.

Now, let's look at how databases can be classified based on what kinds of workload they can be used for.

Processing models

Based on how you want to consume and process data, databases can be categorized into four different processing systems. Let's take a look.

Online transaction processing (OLTP)

OLTP systems support the concept of transactions. A transaction refers to the ability to atomically apply changes (insert, update, delete, and read) to a given system. One popular example is a bank, where withdrawing or depositing money to a given bank account must be done atomically to ensure data is not lost or incorrect. So, the main purpose here is to maintain data integrity and consistency. Also, these systems are generally suited for fast-running queries.

Online analytical processing (OLAP)

OLAP focuses mostly on running queries to analyze multi-dimensional data and to extract some intelligence or patterns from it. Typically, such systems support generating some sort of report that can be used for marketing, sales, financing, budgeting, management, and many more. Data mining and data analytics applications would typically have to have an OLAP system in some form. OLAP doesn't deal with transactions, and the emphasis is more on analyzing large amounts of data from different sources to extract business intelligence. Some databases also provide built-in support for MapReduce to run queries across a large set of data.

A data warehouse is a piece of software that's used for reporting and data analysis. Warehouses are typically developed for OLAP. It is also very common to retrieve the data from OLTP in batches or bulk, run it through an Extract, Load, and Transform (ELT) or Extract, Transform, and Load (ETL) data transformation pipeline, and store it in an OLAP system.

Online event processing (OLEP)

OLEP guarantees strong consistency without the traditional atomic commit protocols or distributed locking. OLEP also focuses on high performance, larger scales, and fault tolerance.

Hybrid transaction/analytical processing (HTAP)

As the name suggests, this system tries to provide the best of both transactions and analytical processing. Most of the NoSQL and NewSQL databases provide support for managing both transactional and analytical workloads. Vitess is a database clustering system that can be used to scale and shard MySQL instances. Vitess provides HTAP features on top of MySQL by allowing a given MySQL instance to be configured as master or read-only, where read-only can be used for analytical queries and MapReduce. It is possible to use CockroachDB as HTAP by propagating changes with the help of change data capture (CDC) in the OLTP cluster or primary cluster to a separate cluster, which is solely used for analytical processing.

Now, let's learn a bit about embedded and mobile databases, including why they exist and some of the most popular ones in this space.

Embedded and mobile databases

Embedded databases usually refer to databases that can be tightly integrated into an application, without needing separate hardware to support them. Also, they don't have to be managed separately. Some of the most popular embedded databases include SQLite, Berkeley DB from Oracle Corporation, and SQL Server Compact from Microsoft Corporation. Embedded databases are also very useful for testing purposes as they can be started within test suites.

Mobile database refers to the class of databases that work with very limited memory footprint and compute and can be deployed within a mobile device. They are typically used for storing user data for apps running on mobile devices. SQLite, SQL Server Compact, Oracle database Lite, Couchbase Lite, SQL Anywhere, SQL Server Express, and DB2 Everyplace belong to this category,

Database storage engines

A database storage engine is a component within a database management system that is responsible for Create, Read, Update, Delete (CRUD) operations and transferring data between disk and memory, without compromising data integrity. Some of the most popular ones include Apache Derby, HSQLDB, InfinityDB, LevelDB, RocksDB, and SQLite. CockroachDB initially started with RocksDB as its database engine, but from release 20.2 onward, Pebble will be the database engine by default. Pebble, as per Cockroach Labs, is a RocksDB-inspired and RocksDB-compatible key-value store focused on the needs of CockroachDB. RocksDB was implemented in C++, whereas Pebble was implemented in Golang. This makes it easier to manage and maintain as CockroachDB itself was written in Golang. This means that we only have to deal with one language now.