Sunday, 9 March 2014

Primary Key in MongoDB

Today we will talk about primary key in MongoDB. Like any other database management system each document in MongoDB needs to be associated with a primary key. But unlike RDBMS MongoDB doesn't support any document without primary key (for e.g. you can create a table in RDBMS without having a primary key though; whether it is advisable or not)

Every document in MongoDB is associated with a key called "_id", which is the primary key for that document. The "_id" field is unique across the collection. The "_id" field can be of any data type but the default data type is ObjectId. If you want to insert a document in a collection without a key called "_id", MongoDB will create the field automatically and it will be of type ObjectId type. Open the mongo shell and insert the following document (with the _id field):

> db.foo.insert({_id:1, i:3, j:4, k:5})


Then find the document from the collection.

> db.foo.find()

{ "_id" : 1, "i" : 3, "j" : 4, "k" : 5 }

Now insert another document without the _id field.

> db.foo.insert({i:4, j:5, k:6})

Again find documents from the collection:

> db.foo.find()

{ "_id" : 1, "i" : 3, "j" : 4, "k" : 5 }
{ "_id" : ObjectId("5319ccee7759cfb3b91c3bcf"), "i" : 4, "j" : 5, "k" : 6 }

You see even if we don't mention "_id" it gets created automatically and the default type is ObjectId. Nevertheless MongoDB restricts duplicate entry of the primary. Try inserting a document with an _id field that already exists, you will get a duplicate key error.

> db.foo.insert({_id:1, i:4, j:5, k:6})
E11000 duplicate key error index: test.foo.$_id_  dup key: { : 1.0 }





 

ObjectId

ObjectId is a special data type that is lightweight, easy to generate and used to create default primary key in a MongoDB collection. The generation of the key ensures unique values to all circumstances even across multiple machines/threads. Let's analyze this in detail.

The ObjectId field uses a 12 byte storage. This storage gives them a string representation of 24 hexadecimal digits. 2 digit for every bytes. Following is the distribution of the bytes:




Timestamp : First 4 bytes represents timestamp in seconds since epoch. So you can understand how the uniqueness is achieved at the second level granularity. Timestamp at the starting of the ObjectId gives couple of more advantages:
  • as timestamp comes first MongoDB sorts the documents in a collections based on the insertion order
  • many drives extracts the create time information from these 4 bytes

Machine : The next 3 bytes represents the machine hostname where the MongoDB is running. This ensures multiple machines don't have duplicate ObjectId's. These 2 bytes are the hash of the machine hostname.

PID : To increase the uniqueness at the process level, MongoDB uses the next 2 bytes to represents the process ids (PID). This will ensure uniqueness across multiple processes (mongod) running in a single machine at the same time.

Increment : The first 9 bytes addresses the situation of different machines, processes and different seconds level. Now think of a situation where you have multiple concurrent requests coming in to generate ObjectId's in a single mongod process running in a single system at the same time. The uniqueness at this level will be achieved by the last 3 bytes. This is an increment factor. This allows upto 256exp3 (16,777,216) unique ObjectId's to be generated per process in one second.


Now we will see some example. One important thing is that the generation of the ObjectId can be done at the server level but that is generally be done at the client side (by the driver or by the shell). Open your Mongo shell and type the following in succession to see the generation pattern of the ObjectId:

> new ObjectId()
ObjectId("5315db03ff2d9ef19928e379")
> new ObjectId()
ObjectId("5315db03ff2d9ef19928e37a")
> new ObjectId()
ObjectId("5315db38ff2d9ef19928e37b")

By the time you should understand that the generation happening at the client side only as we haven't inserted anything in the DB. If you analyze the pattern of 24 hexadecimal digits, you will see the only changing digits are the timestamp digits and the increment digits. If we divide the string we can extract the following:



          Timestamp      Machine        PID      Increment
1st      5315db03       ff2d9e         f199       28e379
2nd     5315db03       ff2d9e         f199       28e37a
3rd      5315db38       ff2d9e         f199       28e37b




For all the generation machine and PID is not changing. For the first two generation timestamp is also same, meaning that these two were generated in quick succession within in a second. But for all the generations the increment field is changing and how the uniqueness is achieved.


As stated previously, if there is no "_id" key present when a document is inserted, one will be automatically added to the inserted document. This can be handled by the MongoDB server but will generally be done by the driver on the client side. The decision to generate them on the client side reflects an overall philosophy of MongoDB: work should be pushed out of the server and to the drivers whenever possible. This philosophy reflects the fact that, even with scalable databases like MongoDB, it is easier to scale out at the application layer than at the database layer. Moving work to the client side  reduces the burden requiring the database to scale.


Today's discussion how some unique thing about the mongo shell. We can work with shell as a standard javascript shell and not only for the DB operations. We will see some more interesting stuffs in the next discussion. 

<< Prev                                                                                     Next >>

No comments:

Post a Comment