MongoDB Cookbook

MongoDB Cookbook

By : Amol Nayak

Buy this Book

MongoDB Cookbook

By: Amol Nayak

Buy this Book

Overview of this book

<p>MongoDB is a high-performance and feature-rich NoSQL database that forms the backbone of numerous complex development systems. You will certainly find the MongoDB solution you are searching for in this book.</p> <p>Starting with how to initialize the server in three different modes with various configurations, you will then learn a variety of skills including the basics of advanced query operations and features in MongoDB and monitoring and backup using MMS. From there, you can delve into recipes on cloud deployment, integration with Hadoop, and improving developer productivity. By the end of this book, you will have a clear idea about how to design, develop, and deploy MongoDB.</p>

MongoDB Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Installing and Starting the MongoDB Server

Introduction

Single node installation of MongoDB

Starting a single node instance using command-line options

Single node installation of MongoDB with options from the config file

Connecting to a single node from the Mongo shell with a preloaded JavaScript

Connecting to a single node from a Java client

Starting multiple instances as part of a replica set

Connecting to the replica set from the shell to query and insert data

Connecting to the replica set to query and insert data from a Java client

Starting a simple sharded environment of two shards

Connecting to a shard from the Mongo shell and performing operations

Command-line Operations and Indexes

Creating test data

Performing simple querying, projections, and pagination from the Mongo shell

Updating and deleting data from the shell

Creating an index and viewing plans of queries

Background and foreground index creation from the shell

Creating unique indexes on collection and deleting the existing duplicate data automatically

Creating and understanding sparse indexes

Expiring documents after a fixed interval using the TTL index

Expiring documents at a given time using the TTL index

Programming Language Drivers

Introduction

Installing PyMongo

Executing query and insert operations using PyMongo

Executing update and delete operations using PyMongo

Aggregation in Mongo using PyMongo

MapReduce in Mongo using PyMongo

Executing query and insert operations using a Java client

Executing update and delete operations using a Java client

Aggregation in Mongo using a Java client

MapReduce in Mongo using a Java client

Administration

Renaming a collection

Viewing collection stats

Viewing database stats

Disabling the preallocation of data files

Manually padding a document

Understanding the mongostat and mongotop utilities

Estimating the working set

Viewing and killing the currently executing operations

Using profiler to profile operations

Setting up users in MongoDB

Understanding interprocess security in MongoDB

Modifying collection behavior using the collMod command

Setting up MongoDB as a Windows Service

Configuring a replica set

Stepping down as a primary instance from the replica set

Exploring the local database of a replica set

Understanding and analyzing oplogs

Building tagged replica sets

Configuring the default shard for nonsharded collections

Manually splitting and migrating chunks

Performing domain-driven sharding using tags

Exploring the config database in a sharded setup

Advanced Operations

Introduction

Atomic find and modify operations

Implementing atomic counters in MongoDB

Implementing server-side scripts

Creating and tailing capped collection cursors in MongoDB

Converting a normal collection to a capped collection

Storing binary data in MongoDB

Storing large data in MongoDB using GridFS

Storing data to GridFS from a Java client

Storing data to GridFS from a Python client

Implementing triggers in MongoDB using oplog

Executing flat plane (2D) geospatial queries in Mongo using geospatial indexes

Spherical indexes and GeoJSON-compliant data in MongoDB

Implementing a full-text search in MongoDB

Integrating MongoDB with Elasticsearch for a full-text search

Monitoring and Backups

Introduction

Signing up for MMS and setting up the MMS monitoring agent

Managing users and groups on the MMS console

Monitoring MongoDB instances on MMS

Setting up monitoring alerts on MMS

Backing up and restoring data in Mongo using out-of-the box tools

Configuring the MMS backup service

Managing backups in the MMS backup service

Cloud Deployment on MongoDB

Introduction

Setting up and managing the MongoLab account

Setting up a sandbox MongoDB instance on MongoLab

Performing operations on MongoDB from MongoLab GUI

Setting up MongoDB on Amazon EC2 using the MongoDB AMI

Setting up MongoDB on Amazon EC2 without using the MongoDB AMI

Integration with Hadoop

Introduction

Executing our first sample MapReduce job using the mongo-hadoop connector

Writing our first Hadoop MapReduce job

Running MapReduce jobs on Hadoop using streaming

Running a MapReduce job on Amazon EMR

Open Source and Proprietary Tools

Introduction

Developing using spring-data-mongodb

Accessing MongoDB using Java Persistence API

Accessing MongoDB over REST

Installing the GUI-based client, MongoVUE, for MongoDB

Concepts for Reference

Write concern and its significance

Read preference for querying

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Read preference for querying

In the previous section, we saw what a write concern is and how it affects the write operations (insert, update, and delete). In this section, we will see what a read preference is and how it affects query operations. We'll discuss how to use a read preference in separate recipes, to use specific programming language drivers.

When connected to an individual node, query operations will be allowed by default when connected to a primary, and in case if it is connected to a secondary node, we need to explicitly state that it is ok to query from secondary instances by executing rs.slaveOk() from the shell.

However, consider connecting to a Mongo replica set from an application. It will connect to the replica set and not a single instance from the application. Depending on the nature of the application, it might always want to connect to a primary; always to a secondary; prefer connecting to a primary node but would be ok to connect to a secondary node in some scenarios and vice versa and finally, it might connect to the instance geographically close to it (well, most of the time).

Thus, the read preference plays an important role when connected to a replica set and not to a single instance. In the following table, we will see the various read preferences that are available and what their behavior is in terms of querying a replica set. There are five of them and the names are self-explanatory:

Read preference	Description
`primary`	This is the default mode and it allows queries to be executed only on primary instances. It is the only mode that guarantees the most recent data, as all writes have to go through a primary instance. Read operations however will fail if no primary is available, which happens for a few moments when a primary goes down and continues till a new primary is chosen.
`primaryPreferred`	This is identical to the preceding primary read preference, except that during a failover, when no primary is available, it will read data from the secondary and those are the times when it possibly doesn't read the most recent data.
`secondary`	This is exactly the opposite to the default primary read preference. This mode ensures that read operations never go to a primary and a secondary is chosen always. The chances of reading inconsistent data that is not updated to the latest write operation are maximal in this mode. It, however, is ok (in fact, preferred) for applications that do not face end users and are used for some instances to get hourly statistics and analytics jobs used for in-house monitoring, where the accuracy of the data is least important, but not adding a load to the primary instance is key. If no secondary instance is available or reachable, and only a primary instance is, the read operation will fail.
`secondaryPreferred`	This is similar to the preceding secondary read preference, in all aspects except that if no secondary is available, the read operations will go to the primary instance.
`nearest`	This, unlike all the preceding read preferences, can connect either to a primary or a secondary. The primary objective for this read preference is minimum latency between the client and an instance of a replica set. In the majority of the cases, owing to the network latency and with a similar network between the client and all instances, the instance chosen will be one that is geographically close.

Similar to how write concerns can be coupled with shard tags, read preferences can also be used along with shard tags. As the concept of tags has already been introduced in Chapter 4, Administration, you can refer to it for more details.

We just saw what the different types of read preferences are (except for those using tags) but the question is, how do we use them? We have covered Python and Java clients in this book and will see how to use them in their respective recipes. We can set read preferences at various levels: at the client level, collection level, and query level, with the one specified at the query level overriding any other read preference set previously.

Let us see what the nearest read preference means. Conceptually, it can be visualized as something like the following diagram:

A Mongo replica set is set up with one secondary, which can never be a primary, in a separate data center and two (one primary and a secondary) in another data center. An identical application deployed in both the data centers, with a primary read preference, will always connect to the primary instance in Data Center I. This means, for the application in Data Center II, the traffic goes over the public network, which will have high latency. However, if the application is ok with slightly stale data, it can set the read preference as the nearest, which will automatically let the application in Data Center I connect to an instance in Data Center I and will allow an application in Data Center II to connect to the secondary instance in Data Center II.

But then the next question is, how does the driver know which one is the nearest? The term "geographically close" is misleading; it is actually the one with the minimum network latency. The instance we query might be geographically further than another instance in the replica set, but it can be chosen just because it has an acceptable response time. Generally, better response time means geographically closer.

The following section is for those interested in internal details from the driver on how the nearest node is chosen. If you are happy with just the concepts and not the internal details, you can safely skip the rest of the contents.

Knowing the internals

Let us see some pieces of code from a Java client (driver 2.11.3 is used for this purpose) and make some sense out of it. If we look at the com.mongodb.TaggableReadPreference.NearestReadPreference.getNode method, we see the following implementation:

@Override
ReplicaSetStatus.ReplicaSetNode getNode(ReplicaSetStatus.ReplicaSet set) {
  if (_tags.isEmpty())
    return set.getAMember();

  for (DBObject curTagSet : _tags) {
    List<ReplicaSetStatus.Tag> tagList = getTagListFromDBObject(curTagSet);
    ReplicaSetStatus.ReplicaSetNode node = set.getAMember(tagList);
    if (node != null) {
      return node;
    }
  }
  return null;
}

For now, if we ignore the contents where tags are specified, all it does is execute set.getAMember().

The name of this method tells us that there is a set of replica set members and we returned one of them randomly. Then what decides whether the set contains a member or not? If we dig a bit further into this method, we see the following lines of code in the com.mongodb.ReplicaSetStatus.ReplicaSet class:

public ReplicaSetNode getAMember() {
  checkStatus();
  if (acceptableMembers.isEmpty()) {
    return null;
  }
  return acceptableMembers.get(random.nextInt(acceptableMembers.size()));
}

Ok, so all it does is pick one from a list of replica set nodes maintained internally. Now, the random pick can be a secondary, even if a primary can be chosen (because it is present in the list). Thus, we can now say that when the nearest is chosen as a read preference, and even if a primary is in the list of contenders, it might not necessarily be chosen randomly.

The question now is, how is the acceptableMembers list initialized? We see it is done in the constructor of the com.mongodb.ReplicaSetStatus.ReplicaSet class as follows:

this.acceptableMembers =Collections.unmodifiableList(calculateGoodMembers(all, calculateBestPingTime(all, true),acceptableLatencyMS, true));

The calculateBestPingTime line just finds the best ping time of all (we will see what this ping time is later).

Another parameter worth mentioning is acceptableLatencyMS. This gets initialized in com.mongodb.ReplicaSetStatus.Updater (this is actually a background thread that updates the status of the replica set continuously), and the value for acceptableLatencyMS is initialized as follows:

slaveAcceptableLatencyMS = Integer.parseInt(System.getProperty("com.mongodb.slaveAcceptableLatencyMS", "15"));

As we can see, this code searches for the system variable called com.mongodb.slaveAcceptableLatencyMS, and if none is found, it initializes to the value 15, which is 15 ms.

This com.mongodb.ReplicaSetStatus.Updater class also has a run method that periodically updates the replica set stats. Without getting too much into it, we can see that it calls updateAll, which eventually reaches the update method in com.mongodb.ConnectionStatus.UpdatableNode:

long start = System.nanoTime();
CommandResult res = _port.runCommand(_mongo.getDB("admin"), isMasterCmd);
long end = System.nanoTime()

All it does is execute the {isMaster:1} command and record the response time in nanoseconds. This response time is converted to milliseconds and stored as the ping time. So, coming back to the com.mongodb.ReplicaSetStatus.ReplicaSet class it stores, all calculateGoodMembers does is find and add the members of a replica set that are no more than acceptableLatencyMS milliseconds more than the best ping time found in the replica set.

For example, in a replica set with three nodes, the ping times from the client to the three nodes (node 1, node 2, and node 3) are 2 ms, 5 ms, and 150 ms respectively. As we see, the best time is 2 ms and hence, node 1 goes into the set of good members. Now, from the remaining nodes, all those with a latency that is no more than acceptableLatencyMS more than the best, which is 2 + 15 ms = 17 ms, as 15 ms is the default that will be considered. Thus, node 2 is also a contender, leaving out node 3. We now have two nodes in the list of good members (good in terms of latency).

Now, putting together all that we saw on how it would work for the scenario we saw in the preceding diagram, the least response time will be from one of the instances in the same data center (from the programming language driver's perspective in these two data centers), as the instance(s) in other data centers might not respond within 15 ms (the default acceptable value) more than the best response time due to public network latency. Thus, the acceptable nodes in Data Center I will be two of the replica set nodes in that data center, and one of them will be chosen at random, and for Data Center II, only one instance is present and is the only option. Hence, it will be chosen by the application running in that data center.

MongoDB Cookbook

By : Amol Nayak

MongoDB Cookbook

By: Amol Nayak

Overview of this book

Related Content you might be interested in

Current Title:

MongoDB Cookbook

Read preference for querying

Knowing the internals