Azure DocumentDB – Design Concepts

It was 2015 when I worked with Microsoft and delivered a small talk on Document DB at “Great Indian Developer Summit” along with my friend and mentor Wriju Da : ). We talked about this technology as part of polyglot persistence pattern and advocated use of multiple persistent stores and not just sticking to RDBMS every time.
It was a buzzword during those days. Since then I’ve worked with many customers and I’ve seen the trend and pace with which NOSQL grew. Today it is an actively sought after database approach and upward trend continues.

Today, I work as a Solution Architect in ComTec and even after NOSQL having gained wide popularity, every time I work with customers I see both the confusion and enthusiasm in the eyes of developers and Architects alike. So, I thought about sharing my bit.

This post is meant for anyone trying to understand NOSQL in general and specifically Azure Document DB from design and architecture perspective. Unlike traditional approach where we try to explain by keeping RDBMS as base (which I find confusing), I’m presenting it in some questions, answers and scenario basis. I’ve used examples from one of the official microsoft documentation instead of creating new ones.

Question:
What is document DB?

Answer:
It is managed (meaning, you don’t have to manage, Microsoft will) NOSQL DB Service from Microsoft.

When explaining NOSQL DB, most of the articles talk about the fact that they don’t have the concept of schemas and hence they are very flexible. While this is true, understanding document DB by virtue of being “schema free” only is not the correct approach (in my view) and often leads to confusion later. So, In order to explain document DB (or NOSQL in general) I prefer explaining following two pointers:

  1. Document DB stores your objects “as is” generally using JSON format (please read JSON format separately if you’re not familiar with it).
    The term “as is” is used loosely to signify the fact that there are no dedicated tables or any other kind of structured containers (of course they are within a container but they are not structured/schema bound). These stored objects in NOSQL databases are also known as “documents” (we will use this term going forward).
  2. There is absolutely no relationship between different objects from storage perspective (you may have that in your head or within app logic).

Question:
If documents are stored “as is” and if they don’t have any relationship between them then how do I perform queries that involve relationship scenarios (like 1:1, 1:may, many:many)? After all having such relations between two different entities is a reality.

Answer:
The first design technique that is used to address relationship challenge is “avoiding any relationships altogether”. Though you can technically store any kind of object from your app to document DB, it is important to design your objects in such a way that they are “self-contained” (not having any dependency on any other object).
This can be explained with a very simple example. Imagine you had “Person” and “Address” tables in RDBMS joined by “Person ID” field. If you create corresponding classes in application also (Person and Address class) and then try to store objects of these types separately in DocumentDB, then you’d be committing a mistake by thinking RDBMS way. Instead, you’d need to think the NOSQL way where you would still create classes called “Person” and “Address” but this time you’d have “Adresses” as one of the property of “Person” class (maybe of type list

). Now, when you create an object of “Person” it will be “self-contained”. The object will be serialized to JSON format and stored in a document DB collection thereby completely avoiding the need for any relationship (and hence joins). So, essentially what’s done here is that we “de-normalized” the data (by not storing objects of separate classes separately)… instead we “embedded” “Address” within “Person” object and they will actually be serialized and stored in document DB as a single document in JSON format with structure something similar to below:
{
“id”: “1”,
“firstName”: “ABC”,
“lastName”: “XYZ”,
“addresses”:
[
{
“street”: “My Street”,
“city”: “MyCity”,

}
],…
}

Remember:Embedding technique means we are de-normalizing and it works well with 1:1 relationships.

Question:
The above example looks good in a specific scenario. But we still have some scenarios where this may not be an option. For example, what if we want to store details about “orders”. Now if we embed orders within “Person” then how do I query based on “Orders” only without keeping “person” in mind? Also, if we follow this approach, we’d end up storing ALL of my entities inside “Person”. That just doesn’t make sense because (a) it will increase size of individual documents (stored objects) and (b) its like going back to forced hierarchical data design where everything stemmed out from a root entity (“person in this case).

Answer:
YES, if you end up creating very large objects (because you are forcing NOSQL way) then you certainly messed up the document DB design. Large objects are hard to move over wire and if you’re thinking about trimming them at your service level then you are opening a Pandora’s box. You’d definitely end up with a solution design that is over complicated and costly.

“Embedding” technique must be used only when you know that it will not lead to large object designs AND when you know that a particular piece of data is almost always required. I read somewhere… “In NoSQL, you don’t design your database based on the relationships between data entities. You design your database based on the queries you will run against it.” This statement clearly tells that if you have the frequent need to query “address” along with “Person” then you can embed otherwise not.

Another caution while using “embedding” technique is to think about cost associated with update operations. As an example below, we can embed “stock” that a user person invested in.

{
“id”: “1”,
“firstName”: “ABC”,
“lastName”: “XYZ”,
“investments”:
[
{
“numberHeld”: 100,
“stock”: { “symbol”: “myStock1”, “value”: 0.5 }
},

]
}

Now, it may seem that embedding is good here but imagine that myStock1 may be changing 100s of times in a day. And if we have 1000s of “Person” in our database who have investments in this stock then updating value of this stock in system will require you to update 1000s of documents.

On top of that, you have to do that from your application logic. This means if your app doesn’t handle update logic carefully you may end up in an “anamoly” scenario where price for same item may be shown different for different users. Document DB wouldn’t be able to help you to avoid such anomalies. Azure Document DB does provide you feature of stored procs or server side triggers to lessen the design burden but still they can’t match what we get with RDBMS.

In short, you are correct, embedding is not the answer to all scenarios.

Question:
So, what is the alternative? How do we handle such scenarios if embedding is not a preferred approach for such scenarios?

Answer:
When “embedding” doesn’t work (and that is dependent on your app scenario), you use another technique called “referencing”. In referencing technique, instead of actually embedding “investment” objects within “person”, we’d store just the ID of investment (which is stored separately). Wait… didn’t I just say that if I’m storing the objects separately and joining by IDs, then I’m committing the crime of thinking the RDBMS way? Yes, I said it. And I said that to actually point out the hard reality that even after coming to NOSQL DB, you may end up using relational features. There is actually NO clean way to handle these “real” problems in NOSQL DBs.

In fact, the problems doesn’t here… let’s say that for those “rare” scenarios you are ready to use “referencing”. Now, we have really opened the Pandora’s box because…

  1. Document DB doesn’t have any concept of “relationships” so this “referencing” technique is purely application logic driven. You need to be “careful” to design application in such a way that whenever you want to fetch data that need two objects consolidated, you’d have to do it programmatically. I am sure you do understand the implication of performing join in application memory vs at server side. I won’t go into those details. Though, Azure DocumentDB provides concept of stored procedures so that you don’t eat your app server’s resources but that support is limited by multiple factors including design decisions.
  2. We are back to normalization approach but this time without any help from underlying database.
  3. We also enter into problem of deciding which class should reference which should be referenced. For example, should we reference “book” within “publisher” or “publisher” within “book” class? Answer is: If the number of the books per publisher is small, then storing the book reference inside the publisher document may be useful otherwise vice versa. So, even referencing technique changes per application scenario (unlike RDBMS).

Question:
What if there is many:many relationship? For example, an Author has many books and a book has many authors. Which class will reference and which will be referenced?

Answer:
To solve this many:many problem, we use technique of creating “joining document”. In this technique neither author nor book class will reference each other. Instead, there will be yet another class, say, “authorBooks” which will contain author ID and IDs of all the books associated with it. But then the join will be very complex and we also created VERY RDBMS oriented data design. To avoid this, we “reference” author IDs in book and book IDs in author class. Of course you can always use mix of embedding and referencing.

Question:
How does Document DB infer schema of objects and how does indexing work?

Answer:
The answer lies in the fact that JSON is self-describing (i.e., has data and schema). Document DB creates a tree structure out of the schema of JSON object. Each property of JSON object represents a node in this tree. Then each value of this tree gets a definite path. Every path in document DB is indexed. Whenever we update any object, the index is updated. Both values and schema are treated equally when it comes to path. The index of paths itself is a tree structure. Indexes are updated synchronously… which means if you update a document, its result will not be returned until index tree is also updated.

Question:
What are performance implications?

Answer:

  • DocumentDB delivers predictable performance.
  • It does so by allocating request units to your DB collection.
  • Request unit represents the cost associated to process a request of reading a single 1KB JSON document having 10 unique properties.
  • Insert, replace, delete etc. are more costly than read operation.
  • Each query or update operation may have different cost requirements based on size of document and query.
  • Determining the approx number of RU that your application may need is difficult so Microsoft has created a RU calculator where you can upload sample JSON documents and specify read, create, update and delete operations /second needed by your app for that JSON document and calculator will give RUs that you should reserve.

Buzz me if you liked it.

-Rahul Gangwar

Advertisements

Azure SQL Database and Azure AD Authn Flow

SQL Azure DB now supports Azure Active Directory based authentication (preview) and this needs some detailing as the official documentation is very high level. Below I am presenting the flow of this authentication mechanism so that you can create all sorts of permutations and combinations and deduce the behavior yourself for any scenario. I am taking example of a typical ASP.NET based website (say developed using C#) that access SQL Azure DB using Azure AD authentication.

The candidates involved in this scenario are described here:

  • Client: Any one who tries to access SQL Azure DB. This could be SQL Management Studio or an ASP.NET web app running on IIS or a REST based Web API.
  • SQL DB: The resource that we are trying to access.
  • Azure AD: The Security Token Service whose tokens are needed by client to access SQL Azure DB

I will not go into the details of other well documented requirements like setting up contained DB, groups, admins,.NET 4.6 and above based client etc. You can read them here: https://azure.microsoft.com/en-in/documentation/articles/sql-database-aad-authentication/

The Scenario:

I am taking a little bit bigger scenario involving both Windows AD and Azure AD and looking at different options at different stages of flow. This will help you comprehend simple scenarios easily.

The flow:

Untitled

In above scenario, we will see how a typical web app access Azure SQL DB using different ways.

(1) User is authenticated with web application. User could have authenticated using claims authentication, Kerberos or any other mechanism… it doesn’t matter. One thing in this step must be clear is that user is actually a local AD user. IIS website may or may not be using ADFS for authentication (this too doesn’t matter for our scenario).

Now, before application could use SQL DB, it first need to obtain a valid token from Azure AD. Further steps explain that process.

(2) Before client/app can obtain token from Azure AD, it will obtain token from ADFS which is federated with Azure AD. This is performed using ws-trust protocol. This can be done using one of the below options depending on what the scenario is.

  • Option 1: IIS web app can prompt user for his/her credentials, collect credentials and send them to ADFS to obtain a Token (T1). This does not require IIS web server to be domain joined.
  • Option 2: If the user was authenticated with IIS web app using claims authentication then web app can use c2wts (impersonation) to authenticate with AD FS and obtain a Token (T1). This requires IIS server to be domain joined.
  • Option 3: If the IIS web app is configured to use integrated Windows authentication with impersonation, then it can use Kerberos (constrained delegation) to authenticate with ADFS and obtain a Token (T1). This requires IIS web server to be domain joined (unless we used NTLM and made solution ugly).
  • Option 4: IIS web app can use a fixed windows identity to access ADFS (irrespective of who the user is) and obtain a Token (T1) from ADFS. If you hardcode username/pwd then IIS web server doesn’t need to be domain joined, otherwise it would need to be domain joined and app pool running under that fixed AD user’s identity.

(3) No matter which option is used by IIS web app, ultimately IIS web app will get a Token (T1) from ADFS. This token T1 is meant for Azure AD and contains user’s attributes defined by claim rules in ADFS. The user could be either the end user or fixed identity (in case of option 4) depending on which option we choose as described above.

(4) Now, the IIS web app will send T1 to Azure AD. Azure AD will accept this token as there already is a federation between ADFS and Azure AD.

(5) Azure AD will verify the token T1 and look at user’s attributes and then create a new token (T2) for the same user to get another token (T2). This token (T2) will be returned to application.

(6) Then the IIS web app will use token T2 to access Azure SQL DB. Azure SQL DB trusts Azure AD issued tokens and allows access using that user’s identity (user of step 1).

Abrupt End!

-Rahul Gangwar
Cloud & Security Consultant

 

 

Azure IP Addresses

VIP –

  • Virtual IP means an IP that is visible externally but it may not be actually assigned to actual resource. So for example, if I have a web farm with two nodes, I don’t want IP of actual resources. Instead, I need an IP that can be used to identify my farm (both the nodes/VMs). Since it is not the IP that is actually allocated to the node, it is called virtual IP.
  • VIPs are always external facing. They are never internal and never assigned to the actual VM.
  • VIPs are always associated with CloudService/Load Balancer. Documentation uses the term CS and Load Balancer interchangeably. For example this article is about VIP on CS but inside it also says that it is associated with Load Balancer. On a side note, from IaaS perspective, load balancer will be more accurate term while from PaaS perspective CS will be better to visualize.

DIP –

  • Dynamic IP is the actual IP that a node/VM gets. These are actually assigned to the VM instance’s NIC card.
  • However, it is always internal. These IPs are assigned to the VM instance and are never available outside a CS/vNet within which the VM is present.
  • These are called Dynamic because by default, the IP that your VM gets can change if you deallocate.
  • In a typical scenario, let’s assume we have 10 VMs in a vNet, then each VM will get an internal IP (DIP) and “most of times” we don’t care even if it changes. Only in few scenarios we don’t want these internal IPs to change (say when your VM is hosting DNS). In that case, we would want this internal IP (DIP) to be static rather than dynamic. So, you could “reserve” these internal IPs as well. In a typical web farm scenario, you’d never care about reserving DIP.

ILPIP/PIP –

  • Public Instance-level IP (earlier called PIP) is an IP that is visible to external world AND is actually assigned to the VM instance.
  • So, they are same as VIP except that unlike VIP, they are actually assigned to VM instance.
  • If you access a VM using its PIP, the traffic never goes through load balancer. That is the reason why you never configure/add/delete any public endpoint (http/https etc.) on cloud service when you are using PIP because endpoints never come into picture.
  • In most of scenarios, you’d never want to use PIP. Some scenarios (like passive FTP) may be eligible candidate for using PIP.
  • You’d never use PIP for web farm scenarios as you will definitely have more than one VMs in your farm and you’d want traffic to be routed via load balancer. So for web farms, you’d want to use VIPs only.

Reserved/Stable IP:

  • These are one and the same. When you reserve an IP it becomes stable… doesn’t change irrespective of your deallocation of resources etc…. though deletion does affect even reserved IPs.
  • You can reserve both VIP and DIP but not PIP (again unless something changed).

-Rahul Gangwar
rg@outlook.in