Book Image

Getting Started with RethinkDB

By : Gianluca Tiepolo
Book Image

Getting Started with RethinkDB

By: Gianluca Tiepolo

Overview of this book

RethinkDB is a high-performance document-oriented database with a unique set of features. This increasingly popular NoSQL database is used to develop real-time web applications and, together with Node.js, it can be used to easily deploy them to the cloud with very little difficulty. Getting Started with RethinkDB is designed to get you working with RethinkDB as quickly as possible. Starting with the installation and configuration process, you will learn how to start importing data into the database and run simple queries using the intuitive ReQL query language. After successfully running a few simple queries, you will be introduced to other topics such as clustering and sharding. You will get to know how to set up a cluster of RethinkDB nodes and spread database load across multiple machines. We will then move on to advanced queries and optimization techniques. You will discover how to work with RethinkDB from a Node.js environment and find out all about deployment techniques. Finally, we’ll finish by working on a fully-fledged example that uses the Node.js framework and advanced features such as Changefeeds to develop a real-time web application.
Table of Contents (15 chapters)
Getting Started with RethinkDB
Credits
About the Author
Acknowledgement
About the Reviewer
www.PacktPub.com
Preface
Index

Rethinking the database


Traditional database systems have existed for many years, and they all have a familiar structure and common methods of communicating, inserting, and querying for information; however, the relatively recent rise and diffusion of NoSQL databases have given developers an increasingly large amount of choice on what to use for their data storage.

Although, new scalability capabilities have most certainly revolutionized the performance that these databases can deliver, most NoSQL systems still rely on the creation of a specific structure that is organized collectively into a record of data. Additionally, the access model of these systems has not changed to adapt today's modern web applications; to get in information, you add a record of data, and to get the information out, you query the database by polling specific values or fields as illustrated by the following diagram:

However, as technology evolves, it's often worth rethinking how we do tasks. RethinkDB takes a completely different approach to the database structure and methods of storing and retrieving information.

What follows is an overview of RethinkDB's main features along with accompanying considerations of how it differs from other NoSQL databases.

Changefeeds

RethinkDB is designed for building real-time applications. Using a feature called Changefeeds, developers can program the database to continuously push data updates to applications in real time. This fundamental architecture choice solves all the problems generated by continuously polling the database, as it is the database itself that serves data to applications in real time by reducing the time and complexity required to develop scalable web apps. The following diagram illustrates how this works:

The best part about how RethinkDB handles Changefeeds is that you don't need to particularly modify your queries to implement them. They look identical to a normal query apart from the changes() command that gets appended to it. Currently, the changes command works on a large subset of queries and allows a client to receive updates on a table, a single document, or even the results from a specific query as they happen.

Horizontal scalability

RethinkDB is a very good solution when flexibility and rapid iteration are of primary importance. Its other big strength is its ability to scale horizontally with very little effort or changes required to how you interact with the database. Horizontal scalability consists of expanding the storage capacity and processing power of a database by adding more servers to a cluster. A single database node is greatly limited by the capacity of the server that hosts it. So, if the dataset exceeds available capacity, data must be sharded among multiple database instances that are connected to each other.

Thankfully, the RethinkDB team set out to make scaling really easy for developers. Users should not have to worry about these issues at all wherever possible. So, with RethinkDB, you can set up a cluster, create table-level shards, and run cross-shard joins and aggregations in less than five minutes using the web interface.

Powerful query language

The RethinkDB query language, ReQL, is a data-driven, abstract, advanced language that embeds itself perfectly in the programming language that you use to build your applications; in fact, in ReQL, queries are constructed simply by making function calls in any programming language that you prefer. ReQL is designed to be pragmatic and works like a fluent API—a set of functions that you can chain together to compose queries. It supports advanced queries including massively parallelized distributed computation. All queries are automatically parallelized on the database server and, whenever possible, query execution is split across multiple cores and datacenters. RethinkDB will automatically break large queries into stages and execute each stage in parallel by combining intermediate data to return a complete query result.

Tip

Official RethinkDB client drivers are available for JavaScript, Python and Ruby; however, support for other programming languages is available through community-supported drivers.

Developer-oriented

RethinkDB is different by design. In fact, it aims to be both developer friendly and operations-oriented, combining an easy-to-use query language with simple controls for operating at scale, while still maintaining an operations-oriented approach of being highly available and extremely scalable.

Since its first release, RethinkDB has gained a large, vibrant, developer community quicker than almost any other database; in fact, today, RethinkDB is the second most popular database on GitHub and is becoming the database of choice for many big and small companies with hundreds of technology start-ups already using it in production.

Document-oriented

One of the reasons behind RethinkDB's popularity among developers is its data model. JSON has become the de-facto standard for data interchange for modern web applications and a persistence layer that naturally stores, queries, and manages JSON. It makes life easier for developers. RethinkDB is a document database built from the ground up to take advantage of JSON's feature set. When developers have to work with objects in databases, it can be troublesome at times due to data mapping and impedance issues. Document-oriented databases solve these issues by replacing the concept of a row with a more flexible model called the document, as documents are objects. After all, programmers who tend to work with objects are going to be much more familiar with storing and querying such data in RethinkDB. If you've never worked with a document before, consider the following example that represents a person using JSON:

{
  "firstName": "Alex",
  "lastName": "Jones",
  "yearOfBirth": 1991,
  "phoneNumbers": {
    "home": "02-345678",
    "mobile": "345-12345678"
  },
  "interests": [
    "programming",
    "football",
    "chess"
  ]
}

As you can see from the preceding example, a document always begins and ends with curly braces, keys and values are separated by colons, and key/value pairs are separated by commas. The key is always a string. A typical JSON document lets you represent values as numbers, strings, bools, arrays, and objects; however, RethinkDB adds other data types that you can use to model your data—binary data, dates and times and the null value. Since version 1.15, RethinkDB also supports geospatial queries for you to include geometry within your JSON documents.

By allowing embedded objects and arrays in JSON, the document-oriented approach used by RethinkDB lets you represent complex relationships with a single document. This fits naturally into the way in which web developers think and model their data.

Lock-free architecture

Traditional, relational, and document databases, more often than not, use locks at various levels to ensure proper data consistency during concurrent access to the database. In a typical NoSQL database that uses locking, once a write request comes in, all readers are blocked until the write completes. What this means is that in some use cases that require large volumes of writes, this architecture could eventually lead to reads to the database getting queued up, resulting in significant performance degradation.

RethinkDB solves this problem by implementing block-level Multi-Version Concurrency Control (MVCC)—a method commonly used by database management systems that provides concurrent access to the database without locking it. Whenever a write operation occurs while there is an ongoing read, the database takes a snapshot of the data block for each relevant shard and temporarily maintains different versions of the blocks in order to execute both read and write operations at the same time.

The main difference between MVCC and lock models is that in MVCC, locks acquired for reading data don't conflict with locks acquired for writing data, and so, reading never blocks writing and vice versa. The concurrency model used by RethinkDB ensures, for example, that you can run an hour-long MapReduce job without blocking the database.

Immediate consistency

For distributed databases, consistency models are a topic of huge importance and RethinkDB makes no exception. A database is said to be consistent when a series of operations or transactions performed on it are applied in a consistent order. What this means is that if we insert some data into a table, it will immediately be available to any other client that wishes to read it. Likewise, if we read some data from the database, we want this data to be the most recently updated version. This is called immediate consistency and is a property of most traditional databases as MySQL.

Some databases as Cassandra decide to prioritize high availability and give up on immediate consistency in the favor of eventual consistency. In this case, if the network goes down, the database will still be able to accept reads and writes; however, applications built at the top of these systems will have to deal with various complexities, such as conflict resolutions and potential out-of-date reads.

RethinkDB, on the other hand, always maintains strong data consistency as all reads and writes get routed to the primary database shard where queries are executed. This results in immediately consistent and conflict-free data, and all reads on the database are guaranteed to return the most recent data.

Tip

The CAP theorem by Eric Brewer states that a database can only have two of the following guarantees at the same time: consistency, availability, and tolerance of network partitions. In distributed systems as RethinkDB, network partitioning is inevitable and must be tolerated, so essentially, what the theorem means is that a tradeoff has to be made between consistency and high availability.

Secondary indexes

Simply put, a secondary index is a data structure that improves the lookup of documents by an attribute other than their primary key at the expense of write performance. This type of index is heavily used in web applications, as it is extremely common to efficiently retrieve all documents based on a field that is not a primary key. RethinkDB also supports compound indexes that are based on multiple fields and other indexes based on arbitrary expressions. Support for secondary indexes was added in version 1.5.

Distributed joins

Most relational databases allow us to perform queries that define explicit relationships between different pieces of data often contained in multiple tables. These queries are called joins and are not supported by most NoSQL databases. The reason for this is that the need for joins is not a function of the data model, but it is a function of the data access. If data is structured in such a way that it conforms structurally to the queries that are being executed, joins can be avoided. The drawback with this approach is that it requires you to structure your data in advance and knowing beforehand how you will access your data often proves to be very tricky.

RethinkDB not only supports joins but automatically compiles them to distributed programs and executes them across the cluster without further intervention from the client. When you use join queries in RethinkDB, what happens is that you connect two sequences of data based on some type of equality; the query then gets routed to the appropriate nodes and the data is combined into a final result that is returned to the client.

Now that you know what RethinkDB is and you've got a comprehensive understanding of its powerful feature set, it's time to take a step forward and start using it. We'll start by downloading and installing the database.