Book Image

Mastering MongoDB 3.x

By : Alex Giamas
Book Image

Mastering MongoDB 3.x

By: Alex Giamas

Overview of this book

MongoDB has grown to become the de facto NoSQL database with millions of users—from small startups to Fortune 500 companies. Addressing the limitations of SQL schema-based databases, MongoDB pioneered a shift of focus for DevOps and offered sharding and replication maintainable by DevOps teams. The book is based on MongoDB 3.x and covers topics ranging from database querying using the shell, built in drivers, and popular ODM mappers to more advanced topics such as sharding, high availability, and integration with big data sources. You will get an overview of MongoDB and how to play to its strengths, with relevant use cases. After that, you will learn how to query MongoDB effectively and make use of indexes as much as possible. The next part deals with the administration of MongoDB installations on-premise or in the cloud. We deal with database internals in the next section, explaining storage systems and how they can affect performance. The last section of this book deals with replication and MongoDB scaling, along with integration with heterogeneous data sources. By the end this book, you will be equipped with all the required industry skills and knowledge to become a certified MongoDB developer and administrator.
Table of Contents (13 chapters)

SQL and NoSQL evolution

Structured Query Language existed even before the WWW. Dr. EF Codd originally published the paper A Relational Model of Data for Large Shared Data Banks, in June 1970, in the Association of Computer Machinery (ACM) journal, Communications of the ACM. SQL was initially developed at IBM by Chamberlin and Boyce in 1974. Relational Software (now Oracle Corporation) was the first to develop a commercially available implementation of SQL, targeted at United States governmental agencies.

The first American National Standards Institute (ANSI) SQL standard came out in 1986 and since then there have been eight revisions with the most recent being published in 2016 (SQL:2016).

SQL was not particularly popular at the start of the WWW. Static content could just be hard coded into the HTML page without much fuss. However, as functionality of websites grew, webmasters wanted to generate web page content driven by offline data sources to generate content that could change over time without redeploying code.

Common Gateway Interface (CGI) scripts in Perl or Unix shell were driving early database driven websites in Web 1.0. With Web 2.0, the web evolved from directly injecting SQL results into the browser to using two- and three-tier architecture that separated views from business and model logic, allowing for SQL queries to be modular and isolated from the rest of a web application.

Not only SQL (NoSQL) on the other hand is much more modern and supervenes web evolution, rising at the same time as Web 2.0 technologies. The term was first coined by Carlo Strozzi in 1998 for his open source database that was not following the SQL standard but was still relational.

This is not what we currently expect from a NoSQL database. Johan Oskarsson, a developer at Last.fm at the time, reintroduced the term in early 2009 to group a set of distributed, non-relational data stores that were being developed. Many of them were based on Google's Bigtable and MapReduce papers or Amazon's Dynamo highly available key-value based storage system.

NoSQL foundations grew upon relaxed ACID (atomicity, consistency, isolation, durability) guarantees in favor of performance, scalability, flexibility and reduced complexity. Most NoSQL databases have gone one way or another in providing as many of the previously mentioned qualities as possible, even offering tunable guarantees to the developer.

Timeline of SQL and NoSQL evolution

MongoDB evolution

10gen started developing a cloud computing stack in 2007 and soon realized that the most important innovation was centered around the document oriented database that they built to power it, MongoDB. MongoDB was initially released on August 27th, 2009.

Version 1 of MongoDB was pretty basic in terms of features, authorization, and ACID guarantees and made up for these shortcomings with performance and flexibility.

In the following sections, we can see the major features along with the version number with which they were introduced.

Major feature set for versions 1.0 and 1.2

  • Document-based model
  • Global lock (process level)
  • Indexes on collections
  • CRUD operations on documents
  • No authentication (authentication was handled at the server level)
  • Master/slave replication
  • MapReduce (introduced in v1.2)
  • Stored JavaScript functions (introduced in v1.2)

Version 2

  • Background index creation (since v.1.4)
  • Sharding (since v.1.6)
  • More query operators (since v.1.6)
  • Journaling (since v.1.8)
  • Sparse and covered indexes (since v.1.8)
  • Compact command to reduce disk usage
  • Memory usage more efficient
  • Concurrency improvements
  • Index performance enhancements
  • Replica sets are now more configurable and data center aware
  • MapReduce improvements
  • Authentication (since 2.0 for sharding and most database commands)
  • Geospatial features introduced

Version 3

  • Aggregation framework (since v.2.2) and enhancements (since v.2.6)
  • TTL collections (since v.2.2)
  • Concurrency improvements among which DB level locking (since v.2.2)
  • Text search (since v.2.4) and integration (since v.2.6)
  • Hashed index (since v.2.4)
  • Security enhancements, role based access (since v.2.4)
  • V8 JavaScript engine instead of SpiderMonkey (since v.2.4)
  • Query engine improvements (since v.2.6)
  • Pluggable storage engine API
  • WiredTiger storage engine introduced, with document level locking while previous storage engine (now called MMAPv1) supports collection level locking

Version 3+

  • Replication and sharding enhancements (since v.3.2)
  • Document validation (since v.3.2)
  • Aggregation framework enhanced operations (since v.3.2)
  • Multiple storage engines (since v.3.2, only in Enterprise Edition)

MongoDB evolution diagram

As one can observe, version 1 was pretty basic, whereas version 2 introduced most of the features present in the current version such as sharding, usable and special indexes, geospatial features, and memory and concurrency improvements.

On the way from version 2 to version 3, the aggregation framework was introduced, mainly as a supplement to the ageing (and never up to par with dedicated frameworks like Hadoop) MapReduce framework. Then, adding text search and slowly but surely improving performance, stability, and security to adapt to the increasing enterprise load of customers using MongoDB.

With WiredTiger's introduction in version 3, locking became much less of an issue for MongoDB as it was brought down from process (global lock) to document level, almost the most granular level possible.

At its current state, MongoDB is a database that can handle loads ranging from startup MVPs and POCs to enterprise applications with hundreds of servers.

MongoDB for SQL developers

MongoDB was developed in the Web 2.0 era. By then, most developers had been using SQL or Object-relational mapping (ORM) tools from their language of choice to access RDBMS data. As such, these developers needed an easy way to get acquainted with MongoDB from their relational background.

Thankfully, there have been several attempts at SQL to MongoDB cheat sheets that explain MongoDB terminology in SQL terms.

On a higher level there are:

  • Databases, indexes just like in SQL databases
  • Collections (SQL tables)
  • Documents (SQL rows)
  • Fields (SQL columns)
  • Embedded and linked documents (SQL Joins)

Some more examples of common operations:

SQL

MongoDB

Database

Database

Table

Collection

Index

Index

Row

Document

Column

Field

Joins

Embed in document or link via DBRef

CREATE TABLE employee (name VARCHAR(100))

db.createCollection("employee")

INSERT INTO employees VALUES (Alex, 36)

db.employees.insert({name: "Alex", age: 36})

SELECT * FROM employees

db.employees.find()

SELECT * FROM employees LIMIT 1

db.employees.findOne()

SELECT DISTINCT name FROM employees

db.employees.distinct("name")

UPDATE employees SET age = 37 WHERE name = 'Alex'

db.employees.update({name: "Alex"}, {$set: {age: 37}}, {multi: true})

DELETE FROM employees WHERE name = 'Alex'

db.employees.remove({name: "Alex"})

CREATE INDEX ON employees (name ASC)

db.employees.ensureIndex({name: 1})

MongoDB for NoSQL developers

As MongoDB has grown from being a niche database solution to the Swiss Army knife of NoSQL technologies, more developers are coming to it from a NoSQL background as well.

Setting the SQL to NoSQL differences aside, users from columnar type databases face the most challenges. Cassandra and HBase being the most popular column oriented database management systems, we will examine the differences and how a developer can migrate a system to MongoDB.

  • Flexibility: MongoDB's notion of documents that can contain sub-documents nested in complex hierarchies is really expressive and flexible. This is similar to the comparison between MongoDB and SQL, with the added benefit that MongoDB can map easier to plain old objects from any programming language, allowing for easy deployment and maintenance.
  • Flexible query model: A user can selectively index some parts of each document, query based on attribute values, regular expressions or ranges, and have as many properties per object as needed by the application layer. Primary, secondary indexes as well as special types of indexes like sparse ones can help greatly with query efficiency. Using a JavaScript shell with MapReduce makes it really easy for most developers and many data analysts to quickly take a look into data and get valuable insights.
  • Native aggregation: The aggregation framework provides an ETL pipeline for users to extract and transform data from MongoDB and either load them in a new format or export it from MongoDB to other data sources. This can also help data analysts and scientists get the slice of data they need performing data wrangling along the way.
  • Schemaless model: This is a result of MongoDB's design philosophy to give applications the power and responsibility to interpret different properties found in a collection's documents. In contrast to Cassandra's or HBase's schema based approach, in MongoDB a developer can store and process dynamically generated attributes.