Azure DocumentDB – Design Concepts

It was 2015 when I worked with Microsoft and delivered a small talk on Document DB at “Great Indian Developer Summit” along with my friend and mentor Wriju Da : ). We talked about this technology as part of polyglot persistence pattern and advocated use of multiple persistent stores and not just sticking to RDBMS every time.
It was a buzzword during those days. Since then I’ve worked with many customers and I’ve seen the trend and pace with which NOSQL grew. Today it is an actively sought after database approach and upward trend continues.

Today, I work as a Solution Architect in ComTec and even after NOSQL having gained wide popularity, every time I work with customers I see both the confusion and enthusiasm in the eyes of developers and Architects alike. So, I thought about sharing my bit.

This post is meant for anyone trying to understand NOSQL in general and specifically Azure Document DB from design and architecture perspective. Unlike traditional approach where we try to explain by keeping RDBMS as base (which I find confusing), I’m presenting it in some questions, answers and scenario basis. I’ve used examples from one of the official microsoft documentation instead of creating new ones.

Question:
What is document DB?

Answer:
It is managed (meaning, you don’t have to manage, Microsoft will) NOSQL DB Service from Microsoft.

When explaining NOSQL DB, most of the articles talk about the fact that they don’t have the concept of schemas and hence they are very flexible. While this is true, understanding document DB by virtue of being “schema free” only is not the correct approach (in my view) and often leads to confusion later. So, In order to explain document DB (or NOSQL in general) I prefer explaining following two pointers:

  1. Document DB stores your objects “as is” generally using JSON format (please read JSON format separately if you’re not familiar with it).
    The term “as is” is used loosely to signify the fact that there are no dedicated tables or any other kind of structured containers (of course they are within a container but they are not structured/schema bound). These stored objects in NOSQL databases are also known as “documents” (we will use this term going forward).
  2. There is absolutely no relationship between different objects from storage perspective (you may have that in your head or within app logic).

Question:
If documents are stored “as is” and if they don’t have any relationship between them then how do I perform queries that involve relationship scenarios (like 1:1, 1:may, many:many)? After all having such relations between two different entities is a reality.

Answer:
The first design technique that is used to address relationship challenge is “avoiding any relationships altogether”. Though you can technically store any kind of object from your app to document DB, it is important to design your objects in such a way that they are “self-contained” (not having any dependency on any other object).
This can be explained with a very simple example. Imagine you had “Person” and “Address” tables in RDBMS joined by “Person ID” field. If you create corresponding classes in application also (Person and Address class) and then try to store objects of these types separately in DocumentDB, then you’d be committing a mistake by thinking RDBMS way. Instead, you’d need to think the NOSQL way where you would still create classes called “Person” and “Address” but this time you’d have “Adresses” as one of the property of “Person” class (maybe of type list

). Now, when you create an object of “Person” it will be “self-contained”. The object will be serialized to JSON format and stored in a document DB collection thereby completely avoiding the need for any relationship (and hence joins). So, essentially what’s done here is that we “de-normalized” the data (by not storing objects of separate classes separately)… instead we “embedded” “Address” within “Person” object and they will actually be serialized and stored in document DB as a single document in JSON format with structure something similar to below:
{
“id”: “1”,
“firstName”: “ABC”,
“lastName”: “XYZ”,
“addresses”:
[
{
“street”: “My Street”,
“city”: “MyCity”,

}
],…
}

Remember:Embedding technique means we are de-normalizing and it works well with 1:1 relationships.

Question:
The above example looks good in a specific scenario. But we still have some scenarios where this may not be an option. For example, what if we want to store details about “orders”. Now if we embed orders within “Person” then how do I query based on “Orders” only without keeping “person” in mind? Also, if we follow this approach, we’d end up storing ALL of my entities inside “Person”. That just doesn’t make sense because (a) it will increase size of individual documents (stored objects) and (b) its like going back to forced hierarchical data design where everything stemmed out from a root entity (“person in this case).

Answer:
YES, if you end up creating very large objects (because you are forcing NOSQL way) then you certainly messed up the document DB design. Large objects are hard to move over wire and if you’re thinking about trimming them at your service level then you are opening a Pandora’s box. You’d definitely end up with a solution design that is over complicated and costly.

“Embedding” technique must be used only when you know that it will not lead to large object designs AND when you know that a particular piece of data is almost always required. I read somewhere… “In NoSQL, you don’t design your database based on the relationships between data entities. You design your database based on the queries you will run against it.” This statement clearly tells that if you have the frequent need to query “address” along with “Person” then you can embed otherwise not.

Another caution while using “embedding” technique is to think about cost associated with update operations. As an example below, we can embed “stock” that a user person invested in.

{
“id”: “1”,
“firstName”: “ABC”,
“lastName”: “XYZ”,
“investments”:
[
{
“numberHeld”: 100,
“stock”: { “symbol”: “myStock1”, “value”: 0.5 }
},

]
}

Now, it may seem that embedding is good here but imagine that myStock1 may be changing 100s of times in a day. And if we have 1000s of “Person” in our database who have investments in this stock then updating value of this stock in system will require you to update 1000s of documents.

On top of that, you have to do that from your application logic. This means if your app doesn’t handle update logic carefully you may end up in an “anamoly” scenario where price for same item may be shown different for different users. Document DB wouldn’t be able to help you to avoid such anomalies. Azure Document DB does provide you feature of stored procs or server side triggers to lessen the design burden but still they can’t match what we get with RDBMS.

In short, you are correct, embedding is not the answer to all scenarios.

Question:
So, what is the alternative? How do we handle such scenarios if embedding is not a preferred approach for such scenarios?

Answer:
When “embedding” doesn’t work (and that is dependent on your app scenario), you use another technique called “referencing”. In referencing technique, instead of actually embedding “investment” objects within “person”, we’d store just the ID of investment (which is stored separately). Wait… didn’t I just say that if I’m storing the objects separately and joining by IDs, then I’m committing the crime of thinking the RDBMS way? Yes, I said it. And I said that to actually point out the hard reality that even after coming to NOSQL DB, you may end up using relational features. There is actually NO clean way to handle these “real” problems in NOSQL DBs.

In fact, the problems doesn’t here… let’s say that for those “rare” scenarios you are ready to use “referencing”. Now, we have really opened the Pandora’s box because…

  1. Document DB doesn’t have any concept of “relationships” so this “referencing” technique is purely application logic driven. You need to be “careful” to design application in such a way that whenever you want to fetch data that need two objects consolidated, you’d have to do it programmatically. I am sure you do understand the implication of performing join in application memory vs at server side. I won’t go into those details. Though, Azure DocumentDB provides concept of stored procedures so that you don’t eat your app server’s resources but that support is limited by multiple factors including design decisions.
  2. We are back to normalization approach but this time without any help from underlying database.
  3. We also enter into problem of deciding which class should reference which should be referenced. For example, should we reference “book” within “publisher” or “publisher” within “book” class? Answer is: If the number of the books per publisher is small, then storing the book reference inside the publisher document may be useful otherwise vice versa. So, even referencing technique changes per application scenario (unlike RDBMS).

Question:
What if there is many:many relationship? For example, an Author has many books and a book has many authors. Which class will reference and which will be referenced?

Answer:
To solve this many:many problem, we use technique of creating “joining document”. In this technique neither author nor book class will reference each other. Instead, there will be yet another class, say, “authorBooks” which will contain author ID and IDs of all the books associated with it. But then the join will be very complex and we also created VERY RDBMS oriented data design. To avoid this, we “reference” author IDs in book and book IDs in author class. Of course you can always use mix of embedding and referencing.

Question:
How does Document DB infer schema of objects and how does indexing work?

Answer:
The answer lies in the fact that JSON is self-describing (i.e., has data and schema). Document DB creates a tree structure out of the schema of JSON object. Each property of JSON object represents a node in this tree. Then each value of this tree gets a definite path. Every path in document DB is indexed. Whenever we update any object, the index is updated. Both values and schema are treated equally when it comes to path. The index of paths itself is a tree structure. Indexes are updated synchronously… which means if you update a document, its result will not be returned until index tree is also updated.

Question:
What are performance implications?

Answer:

  • DocumentDB delivers predictable performance.
  • It does so by allocating request units to your DB collection.
  • Request unit represents the cost associated to process a request of reading a single 1KB JSON document having 10 unique properties.
  • Insert, replace, delete etc. are more costly than read operation.
  • Each query or update operation may have different cost requirements based on size of document and query.
  • Determining the approx number of RU that your application may need is difficult so Microsoft has created a RU calculator where you can upload sample JSON documents and specify read, create, update and delete operations /second needed by your app for that JSON document and calculator will give RUs that you should reserve.

Buzz me if you liked it.

-Rahul Gangwar

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s