INVENTRIZ...: January 2014

Saturday, 25 January 2014

Features That Makes the difference

Hello friends... In our last discussion we have covered how the distributed architecture will give better performance for all the BIG data solution and NoSQL technologies. In this post we will cover various aspects of MongoDB which make the difference.

Ease of Use:

At the very core of its architecture MongoDB is a document-oriented database and not a relational one. The main reason of moving away from a relational structure is to make the scale out easier though that brings other advantages as well.

MongoDB replaces the concept of "row" with a concept of "Document". A document is nothing but a java script object formatted string (JSON) with a Key-Value structure, that allows embedding of child document and arrays. This makes any hierarchical relationship in a single record.

Think of a scenario, where you want to find a full hierarchy of any structure starting from any of the child node traversing to its root. You use CONNECT BY PRIOR query right? ... Which itself is a very costly one. This will be far more faster in MongoDB with this document structure.

Schema Less:

With your relational database you define a schema first before you start doing any development. Now keeping today's development agility in mind (where change is the only constant) we cannot avoid changes in our data model as we move forward towards the development life cycle. And this would obviously impact your application layer.

MongoDB is schema less. You don't have to restrict yourself in a strict definition. The document's keys and values are not of fixed size and type. Addition and removal of keys also become easier with makes the development faster.

Easy Scaling:

We have already discussed how scale out provides better performance. MongoDB has been designed to scale out. The document-oriented data structure makes it easier to split one record across multiple servers. The database automatically takes care takes care of balancing the data and load across a cluster, redistributing documents automatically and routing user requests to the correct machines. So the developers can focus on programming the application, not scaling it. When a cluster need more capacity, new machines can be added and MongoDB will figure out how the existing data should be spread to them.

Indexing:

MongoDB is designed to support generic secondary indexes, allowing a variety of fast queries. This provides unique, compound and full text indexes. One of the unique feature of MongoDB is to support geospatial indexes.

Aggregation:

MongoDB supports a pipeline concept to build complex aggregation from simple pieces and allow the database to optimize it. This helps the user to implement complex logic, filter, sort, skip, limit in one query.

Special Collection:

In MongoDB a table (in RDBMS) is known as a Collection. MongoDB supports time-to-live collections for data that should expire at a certain time, such as sessions. Fixed-size collections, which are useful for holding recent data, such as logs are also supported.

Other than these unique features other common features like Replication, Backup & Recovery, Monitoring statistics, Security and User Administration are also supported here.

As we pointed out earlier two main features of a relational DB namely join and multi-document transaction is not supported in MongoDB. There are provisions and recommendations on how to address these "limitation"s. The schema design of MongoDB plays a major role in it and that will answer your question.

Before going to the schema design we will quickly touch upon some of the MongoDB nomenclature and practicals.

<< Prev Next >>

Sunday, 19 January 2014

A Deep Dive to Achieve Better Performance and Scalability

Welcome back friends.

First let me thank you all for the over whelming (unexpected too) responses after the last post. We have received multiple responses from multiple channels.

The main goal for this discussion would be to elaborate the point of having much better performance and ease of scalability of the NoSql technologies over the traditional RDBMS.

Distributed File System Architecture:

The KEY lies in the distributed file system architecture and parallel processing of all the BIG data solutions. Let's discuss this with an example. I will ask you to do some math here.

Let say there is a data of 1 TB we need to read from a disk. The disk has 4 I/O channels each of which is having 100 MB/sec I/O speed. Your assignment is to calculate what time it would take to read the whole 1 TB of data using those 4 I/O channels. (no scroll please before you calculate .. :))

Time to read through one channel is
t = (1000000000000 / 100 x 1000000) = 10000 sec

Time to read through 4 channels is
t = 10000/4 = 2500 sec = 41.66 min

Pretty simple right!!!

Now let's distribute this data into 10 different chunks into 10 different disks with similar configuration as earlier. What would be the total time to read the 1 TB data this time?

t = 2500 / 10 = 250 sec = 4.16 min

A straightaway advantage of 10 times in performance, cool !!! (though in real life this won't be a straight math; there would be some additional time required due to the network latency; but there would be advantage).

In Real Time Scenario:

You might think what is the big deal of having the distributed architecture in case of the traditional RDBMS and why is that we need to go for a totally new solution. That is the whole essence.

To build up a system highly scalable and cost-effective at the same time, we have two approaches:
- Vertical Scalability (scaling up)
- Horizontal Scalability (scaling out)

Vertical scalability means upgrading the resources of the same system (like RAM, processor or more disk space etc.). This is not a cost effective solution as high end servers are costly and at one point of time this would become out of reach in practical.

Horizontal scalability means adding up more resources in cluster for parallel processing. With Mongo DB, scaling out has become very easy (easily configurable) and one can scale out their existing database with low-cost commodity hardware. To start with a single node cluster can be configured. With the growing of its data volume more servers can be added into the cluster (this technique is called sharding in Mongo DB terminology) without affecting the application development and with zero down time.

The other aspect of Mongo DB is its variety of data to be supported. By saying variety of data we mean structured, semi-structured as well as unstructured data can very well be laid out in the design.

In our next discussion we will talk about some of the important features of Mongo DB and what value those features bring in as compared to its counterpart.

See you there ...

<< Prev Next >>

Thursday, 16 January 2014

Why Mongo DB

Hi Friends, we are back again with the Mongo DB mania.

In our last discussion we had discussed about installation of Mongo DB in your system. Before going to the next level of details we will touch upon some of the background stuffs. In this post we will see some interesting topics covering in which stage Mongo DB has been evolved and why at all we should use this. This is one of the pretty basic question one should ask before learning or using any new technology.

The BIG DATA Scenario:

Let's get a step back to 2006-2007. Industry leaders started facing the impact of rate of data growth, which started increasing rapidly every year. A terabyte of data once a less heard has become pretty common and frequent scenario now-a-days (we are getting 1 TB flash as well). An airline jet collects 10 terabyte of sensor data for every 30 minutes of its flying time. NYSE generates about one terabyte of new trade per day to perform stock trading analytic to determine trends. Facebook users spend 10.5 billion minutes (almost 20,000 years) online on the social network It has an average of 3.2 billion likes and comments are posted every day. Twitter has over 500 million registered users.
As per as the Global Data volume is concerned it has already crossed 2.7 Zeta bytes (1 ZB = 1 Billion Tera bytes, please count the trailing zeros yourself).

With all these statistics is mind, it can proudly be said that storing of this enormous data is NOT a challenge (as cost of storing the data keeps becoming cheaper and cheaper, with the latest invent of semiconductor technology). The problem comes when we started processing using these BIG data. Even though storing capacity is getting higher but IO speed is not increasing much comparatively and that is what becoming a bottleneck.

Challenge with the traditional RDBMS:

In the last 10 years, the Internet has challenged relational databases in ways nobody could have foreseen. First you have a single server with a small data set. Then you find yourself setting up replication so you can scale out reads and deal with potential failures. And, before too long, you’ve added a caching layer,
tuned all the queries, and thrown even more hardware at the problem. Eventually you arrive at the point when you need to shard the data across multiple clusters and rebuild a ton of application logic to deal with it. And soon after that you realize that you’re locked into the schema you modeled so many months before.

Why? Because there’s so much data in your clusters now that altering the schema will take a long time and involve a lot of precious DBA time. It’s easier just to work around it in code. This can keep a small team of developers busy for many months. In the end, you’ll always find yourself wondering if there’s a better way—or why more of these features are not built into the core database server.

Welcome to the NoSQL world:

Keeping all these challenges in mind it was time to come up with an alternative. Mongo DB has been invented by 10 Gen, which is a powerful, flexible, and scalable general-purpose database. It combines
the ability to scale out with features such as secondary indexes, range queries, sorting, aggregations, and geo-spatial indexes.

The easy-to-use features of Mongo DB enables the agile developers to build their application fast with cost effective scaling out capability providing high performance.

Though there are some compromises from features perspective as compared to the relational databases. It doesn't support join and multi-document (if you don't know a document, don't worry, we will cover in the next posts) transaction. This was a well thought design decision as to support performance and scalability prior to those two features. With that said, it is very well guided in order to design your application.

Nice stuff to know!! We will now grow in pace...

<< Prev Next >>