Book Image

Learning Couchbase

By : Henry Potsangbam
Book Image

Learning Couchbase

By: Henry Potsangbam

Overview of this book

This book achieves its goal by taking up an end-to-end development structure, right from understanding NOSQL document design to implementing full fledged eCommerce application design using Couchbase as a backend. Starting with the architecture of Couchbase to get you up and running, this book quickly takes you through designing a NoSQL document and implementing highly scalable applications using Java API. You will then be introduced to document design and get to know the various ways to administer Couchbase. Followed by this, learn to store documents using bucket. Moving on, you will then learn to store, retrieve and delete documents using smart client base on Java API. You will then retrieve documents using SQL like syntax call N1QL. Next, you will learn how to write map reduce base views. Finally, you will configure XDCR for disaster recovery and implement an eCommerce application using Couchbase.
Table of Contents (12 chapters)
Index

What is NoSQL and why do we need it?


It's always a challenge to introduce a new technology, especially when it changes the fundamentals that have been taught for so long. An example is the one I am going to introduce right now. However, it's easy to comprehend it if we understand the rationale behind it. So, let's understand the need for NoSQL. Oh, hold on! We will elaborate on this later.

We are all aware of and use Relational Database Management Systems (RDBMS). RDBMS is a database management system, which is based on the relational model invented by E. F. Codd, that has features such as normalization, joins, foreign keys, and so on. (Examples of such a database management system would be MySQL, Oracle, DB2 DB, and so on). RDBMS provides features such as transactions, table joins, locking mechanisms, ACID properties, and so on. However, there are some limitations to RDBMS, predominantly in terms of scalability and readiness for schema changes.

Note

ACID stands for Atomicity, Consistency, Isolation, and Durablity. These are properties that are essential for supporting transactions in any database system. In order to guarantee a meaningful and successful transaction, the system has to support all of these properties:

  • Atomicity: The operation will be performed as a single unit

  • Consistency: All the operations will ensure a valid state and consistency of data at the end of the transaction

  • Isolation: No two transactions will intervene with each other

  • Durability: The transaction will survive system failures

In order to get more clarity, let's look at a scenario. Your organization has recently launched an e-commerce application and you are the technical architect. Everything has been going on smoothly and everyone, including your boss, is happy with the outcome. However, after a couple of months, you start getting complaints from the business team that the application is not performing well. After some investigation, you realize that the consumer base has increased, hence Users traffic has increased. The application server and the infrastructure are not able to handle such an increase in traffic. So what will you do? Think about it. If you are like most other architects, the initial measures would be to scale the application servers, introduce multiple servers, and provide a load balancer, or increase the system resources, such as the RAM and CPU. After you take these steps, the application seems to show some improvement.

But after a couple of weeks comes a realization that the same improvement needs to be done at the database server too. So, what can be done? You have two options:

  • Vertical scaling

  • Horizontal scaling

The first is vertical scaling, wherein you increase the hardware resources in terms of CPU and RAM. The second is horizontal scaling, wherein you increase the sever nodes.

However, there is a challenge here; we can't just scale the database server horizontally as we do for application servers. If we need to scale database servers horizontally, we need to find a mechanism to distribute data across the servers, balance the load, and what not! The only easy way left is to increase your hardware resources. However, after a certain stage, physical servers can't expand further due to limitations of sockets, chips, and so on, just like if you have four CPU socket servers, then you cannot scale up further than that. Therefore, we need to find a way to scale out, horizontally, when we anticipate an increase in the number of database requests or hits or load in the database layer. Such a situation is encountered in most content-driven, social networking, and e-commerce sites, where there are a large number of transactions taking place in milliseconds.

Besides this, due to dynamics in business functions, the database schema needs to be changed very frequently, which is very common in agile development. It is difficult to incorporate the changes in RDBMS. Sometimes you need to bring the application down to modify the schema, such as adding one column in a table. In order to address such issues, companies such as Facebook and Google started exploring alternatives to RDBMS for data storage that can scale out and handle changes in schemas seamlessly without any impact on business operations. These are the fundamentals of NoSQL.

So what is NoSQL?

NoSQL is a nonrelational database management system that is different from traditional relational database management systems in significant ways. It is designed for distributed data stores in which there are very large-scale data storage requirements (terabytes and petabytes of data). These types of data storage mechanisms may not require fixed schemas, avoid join operations, and typically scale horizontally.

The main feature of NoSQL is that it is schemaless. There is no fixed schema to store data. Also, there is no join between one or more data records or documents. However, nowadays, most of the NoSQL systems have started providing join features. It allows distributed storage and utilizes computing resources, such as CPU and RAM, spanning across the nodes that are part of the NoSQL cluster.

There are different types of NoSQL data stores. Let's try to cover the four main categories of NoSQL systems in brief:

  • Key-value store: A simple data storage system that uses a key to access values. Some examples are Redis, Riak, and DynamoDB.

    Use Case: Multiplayer online gaming to manage each player session.

  • Column family store: A sparse matrix system that uses a row and a column as keys, for example, Apache HBase, Apache Cassandra.

    Use Case: Stream massive write loads such as log analysis.

  • Graph store: This is used for relationship-intensive problems. An example is Neo4j.

    Use Case: Complicated graph problems, such as moving from one point to another.

  • Document store: This is used to store hierarchical data structures directly in the database, for example, MongoDB (10Gen), CouchDB, and Couchbase.

    Use Case: Storing structured product information.

Why do we need NoSQL?

Electronic data is generated at rapid speed from a variety of sources, such as social media, web server logs, and e-commerce transactions and so on; these include Facebook, Google+, e-commerce websites such as Amazon, eBay, and others. Personal user information, social graphs, geolocation data, user-generated content, and machine logging data are just a few examples of areas in which data has been increasing exponentially. Such data is termed as big data, which usually has a variety of data formats, is generated at a rapid speed, and contains a large set of data. In order to derive information from such big data, large amounts of data have to be processed, for which RDBMS was never designed! The evolution of NoSQL databases is the way to handle such huge data efficiently.

Most of NoSQL databases provide the following benefits:

  • It provides a flexible data model. You don't need to worry about the schema. You can design your schema depending on the needs of your application domain and not by storage demands.

  • It's scalable and can be done very easily. Since it's a distributed system, it can scale out horizontally without too many changes in the application. In some of the NoSQL systems, such as Couchbase, you can scale out with a few mouse clicks and rebalance it very easily.

  • It provides high availability, since there are multiple servers and data are replicated across nodes.

Since NoSQL is a distributed database system, you need to know a theorem called CAP to understand it better, and take better decisions when the system fails in a distributed environment. Let me explain the CAP theorem to you. There are three important properties of this theorem:

  • Consistency: What comes to your mind when we say consistency in a distributed system? When data is replicated to multiple nodes in a distributed system, it should return the same value or state as any of the other replicated nodes. Generally speaking, the data in all nodes must be consistent with each other.

  • Availability: Systems should be able to serve client requests all the time, irrespective of the situation. In any distributed system, there are multiple nodes and it is ideal that the failure of a node should not stop the availability of the system. In short, the client should be able to perform read, write, and update operations at all times.

  • Partition tolerance: In any distributed system, depending on an algorithm such as hashing, data or records are partitioned across the nodes or the servers in the database ecosystem. Failures in replicating or transferring data between cluster nodes should not stop the system from responding to client requests. This feature of providing tolerance when there is a disturbance between nodes is called partition tolerance.

The following is a Venn diagram depicting the CAP theorem:

So, you have understood what the CAP properties signify. The CAP theorem states that in any distributed system it can provide only two features out of these three features. Depending on the type of use cases that the system is intended to address, the database system can choose two out of these three features.

There are a number of database systems available in the IT software market—RDBMS such as MySQL, or NoSQL such as MongoDB, Couchbase, Cassandra, and so on. How do you choose a database system that suits your business requirements? This theorem will help you to decide it. In our context, Couchbase has opted for AP—availability and partition tolerance. So, if your application demands availability and partition tolerance more than consistency, you could opt for Couchbase. However, Couchbase provides a feature called eventual consistency, which will be discussed later in Chapter 6, Retrieving Documents without Keys Using Views. This feature enables the developer to decide the consistency level per operation.

Having understood what NoSQL is all about and why it's a buzzword nowadays, let's try to understand Couchbase, which is the purpose of this book.

Couchbase Server is a persistent, distributed, document-based database that is part of the NoSQL database movement. It combines the capabilities of Apache CouchDB: document-based and indexing-with that of a Membase database, an integrated RAM caching layer, enabling it to support very fast operations, such as create, store, update, and retrieval.

Couchbase Server is a leading NoSQL database project that focuses on distributed database technology and the surrounding ecosystems. It supports both key-value and document-oriented use cases. All components are available under the Apache 2.0 Public License. It can be obtained as packaged software in both an enterprise edition, which is rigorously tested and provides support, and a community edition that do not have support and is open source.

Let's cover some of the main features of Couchbase Server here:

  • Schemaless: You don't need to worry about the database schema when changing your application object. Records can have different structures; there is no fixed schema. It allows changes in a data model for rapid application development easily, without the need to perform expensive alter table operations in the database. In short, it provides a flexible data model with JSON support.

  • JSON-based document structure: The documents in Couchbase are natively stored as JSON. In a document-based NoSQL, metadata of the data like types are stored along with the data and normally all related information is stored together as a single document. When you build an application, you don't require explicit mapping of application objects with that of the database schema. Couchbase provides an interface to create new documents for viewing and editing.

  • Built in clustering with replication: The Couchbase also provides built-in clustering, wherein all nodes in a cluster are equal. Furthermore, it provides data replication with auto-failover.

  • 365 day availability: The Couchbase cluster provides almost zero downtime maintenance. You can remove a node of the cluster for maintenance and join the cluster after the maintenance operation without suffering any application downtime. High availability of data in the cluster is provided by the replication mechanism.

  • Cache: By default, all documents are stored in the RAM, and hence provide a built-in managed cache. It provides easy scalability and consistent high performance by adding nodes, thus increasing the RAM in the cluster resources pool.

  • Web UI: There are simple and easy-to-use admin APIs and UIs provided for smooth administration of the Couchbase cluster. It can also be used to monitor the cluster with ease.

  • Varieties of SDK: Software development kits for a variety of languages, such as Java, PHP, and so on, are provided to connect to Couchbase Servers.

  • In a Couchbase cluster, there are a number of nodes and all nodes are equal; the cluster works on the concept of peer-to-peer. You can easily scale the cluster by adding a node to it. Since all the nodes are the same, there is no single point of failure. The cluster ensures that every node manages some active data and some replica data. The data is distributed across the cluster, and hence the load is also uniformly distributed using auto-sharding. The data is divided into chunks and distributed across the nodes automatically.

    Note

    Auto-sharding is a feature of NoSQL databases that spreads documents across the nodes in a cluster automatically. It remains transparent to the application that consumes the data from the cluster.