Book Image

MongoDB Cookbook

By : Amol Nayak
Book Image

MongoDB Cookbook

By: Amol Nayak

Overview of this book

<p>MongoDB is a high-performance and feature-rich NoSQL database that forms the backbone of numerous complex development systems. You will certainly find the MongoDB solution you are searching for in this book.</p> <p>Starting with how to initialize the server in three different modes with various configurations, you will then learn a variety of skills including the basics of advanced query operations and features in MongoDB and monitoring and backup using MMS. From there, you can delve into recipes on cloud deployment, integration with Hadoop, and improving developer productivity. By the end of this book, you will have a clear idea about how to design, develop, and deploy MongoDB.</p>
Table of Contents (17 chapters)
MongoDB Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Connecting to a shard from the Mongo shell and performing operations


In this recipe, we will be connecting to a shard from a command prompt; we will also see how to shard a collection and observe the data splitting in action on some test data.

Getting ready

Obviously, we need a sharded mongo server setup that is up and running. See the previous recipe for more details on how to set up a simple shard. The mongos process, as in the previous recipe, should be listening to port number 27017. We have got some names in a JavaScript file called names.js. This file needs to be downloaded from this book's site and kept on the local filesystem. The file contains a variable called names, and the value is an array with some JSON documents as the values, each one representing a person. The contents look as follows:

names = [
  {name:'James Smith', age:30},
  {name:'Robert Johnson', age:22},
…
]

How to do it…

Let's take a look at the steps in detail:

  1. Start the Mongo shell and connect to the default port on the localhost as follows (this will ensure that the names will be available in the current shell):

    mongo --shell names.js
    MongoDB shell version: 2.4.6
    connecting to: test
    mongos>
    
  2. Switch to the database that will be used to test sharding as follows (we call it shardDB):

    mongos> use shardDB
    
  3. Enable sharding at the database level as follows:

    mongos> sh.enableSharding("shardDB")
    
  4. Shard a collection called person as follows:

    mongos>sh.shardCollection("shardDB.person", {name: "hashed"}, false)
    
  5. Add test data to the sharded collection as follows:

    mongos> for(i = 1; i <= 300000 ; i++) {
    ... person = names[Math.round(Math.random() * 100) % 20]
    ... doc = {_id:i, name:person.name, age:person.age}
    ... db.person.insert(doc)
     }
    
  6. Execute the following command to get a query plan and the number of documents on each shard:

    mongos> db.person.find().explain()
    

How it works…

This recipe demands some explanation. We have downloaded a JavaScript file that defines an array of 20 people. Each element of the array is a JSON object with a name and age attribute. We started the shell that connects to the mongos process loaded with this JavaScript. We then switched to shardDB, which we will use for the purpose of sharding.

For a collection to be sharded, the database in which it will be created needs to be enabled for sharding first. We do this using sh.enableSharding().

The next step is to enable the collection to be sharded. By default, all the data will be kept on one shard and will not be split across different shards. Think about how Mongo will be able to meaningfully split the data. The whole intention is to split it meaningfully and as evenly as possible so that whenever we query based on a shard key, Mongo will easily be able to determine which shard(s) to query. If a query doesn't contain a shard key, the execution of the query will happen on all the shards, and the data will then be collated by the mongos process before returning it to the client. Thus, choosing the right shard key is very crucial.

Let's now see how to shard the collection. We will do this by invoking sh.shardCollection("shardDB.person", {name: "hashed"}, false). There are three parameters here.

  • The first parameter specifies a fully qualified name of the collection in the <db name>.<collection name> format. This is the first parameter of the shardCollection method.

  • The second parameter specifies the field name to shard upon in the collection. This is the field that will be used to split the documents on the shards. One of the requirements of a good shard key is that it should have high cardinality (the number of possible values should be high). In our test data, the name value has a very low cardinality and thus, is not a good choice as a shard key. We thus hash this key when using it as a shard key. We do so by mentioning the key as {name: "hashed"}.

  • The last parameter specifies whether the value used as a shard key is unique or not. The name field is definitely not unique; thus, it will be false. If the field was, say, the person's social security number, it could have been set as true. Also, SSN is a good choice for a shard key due to its high cardinality. Remember though, for the query to be efficient, the shard key has to be present in it.

The last step is to see the execution plan to find all the data. The intent of this operation is to see how the data is being split across two shards. With 3,00,000 documents, we expect something around 1,50,000 documents on each shard. From the explain plan's output, the shard attribute has an array with a document value for each shard in the cluster. In our case. we have two; thus. we have two shards that give the query plan for each shard. In each of them, the value of n is something to look at. It should give us the number of documents that reside on each shard. The following code snippet is the relevant JSON document we see from the console. The number of documents on shards one and two is 164938 and 135062, respectively:

"shards" : {
  "localhost:27000" : [
    {
      "cursor" : "BasicCursor",
      "isMultiKey" : false,
      "n" : 164938,
      "nscannedObjects" : 164938,
      "nscanned" : 164938,
      "nscannedObjectsAllPlans" : 164938,
      "nscannedAllPlans" : 164938,
      "scanAndOrder" : false,
      "indexOnly" : false,
      "nYields" : 1,
      "nChunkSkips" : 0,
      "millis" : 974,
      "indexBounds" : {

      },
      "server" : "Amol-PC:27000"
    }
  ],
  "localhost:27001" : [
    {
      "cursor" : "BasicCursor",
      "isMultiKey" : false,
      "n" : 135062,
      "nscannedObjects" : 135062,
      "nscanned" : 135062,
      "nscannedObjectsAllPlans" : 135062,
      "nscannedAllPlans" : 135062,
      "scanAndOrder" : false,
      "indexOnly" : false,
      "nYields" : 0,
      "nChunkSkips" : 0,
      "millis" : 863,
      "indexBounds" : {

      },
      "server" : "Amol-PC:27001"
      }
    ]
  }

There are a couple of additional things that I recommend you all to do.

Connect to the individual shard from the Mongo shell and execute queries on the person collection. See that the counts in these collections are similar to what we see in the preceding plan. Also, one can find out that no document exists on both the shards at the same time.

We discussed in brief how cardinality affects the way the data is split across shards. Let's do a simple exercise. We will first drop the person collection and execute the shardCollection operation again but, this time, with the {name: 1} shard key instead of {name: "hashed"}. This ensures that the shard key is not hashed and stored as is. Now, load the data using the JavaScript function we used earlier in step 5 and then execute explain on the collection once the data is loaded. Observe how the data is now split (or not) across the shards.

There's more…

A lot of questions might now come up, such as what are the best practices, what are some tips and tricks, how is the sharding thing pulled off by MongoDB behind the scenes in a way transparent to the end user, and so on.

This recipe only explained the basics. All these questions will be answered in Chapter 4, Administration.