Designing Networks – Part 1 – Performance/Latency

This article is part of a larger series of articles titled “Azure Networks for Architects“.

In this article, we will discuss about when should we create networks and how to organize resources within it to meet application’s non-functional requirements like performance, security and management.

Networks are created for meeting “communication” needs of a solution. Every design decision that we take are guided by “some” goal in mind. Based on our goal, our network design decision changes. The first step is to decide how many networks do we need and which components of solution will go where. There are two common approaches that most of the organizations use while creating a network.

  1. Create networks based on security and/or management requirements
  2. Create networks based on application design (generally for best performance)

In order to understand above approaches, let us analyze some of the network designs. Every network design that we analyze in this series will be split into a separate article. In this part (part 1) we will analyze a network shown below and continue analyzing other network designs in upcoming parts.

Example 1:

1

Key design aspects about network shown in above figure (and relation with Azure) are given below:

  1. In above figure, resources are grouped based on “types” and “security” needs. For example, all storage resources (DBs, File Servers etc.) are put within a single network (grouped by type). All internet facing resources (web servers, proxy servers…) are placed in a dedicated network (commonly called as DMZ or perimeter network). All supporting infrastructure resources (DC, DNS, ADFS…) are grouped in a separate network. All non-public facing, custom application resources (say windows services…) are clubbed in a separate network.Azure Design Implication:

Above grouping can introduce performance related issues. If you observe, most of the resources (servers) that are grouped together, do not talk to each other. In fact they need to talk to resources placed in other networks. For example, a web server placed in DMZ will call corresponding app server in app network which in turn will access app DB (in Storage N/W) and use supporting infrastructure (like DC) from Corp N/W. Being in separate n/w could lead to high latency issues because communication between networks is slower than communication within same n/w.

On Azure, this could be solved by placing all resources within a single n/w and enforcing the separation boundaries via subnets (similar to shown below… avoided all security aspects). On Azure, as long as two resources are within same network, the communication will be fast. As long as two resources are within same network, (even if they belong to different subnets), this has no latency (performance) implication because the traffic between two subnets of same n/w doesn’t go via any specific router. Instead, the Azure fabric does simple packet filtering (which it does anyways). So, there is no extra overhead related to subnet separation. This is unlike on premise scenarios where your subnets can actually be separated by physical routers (which turn out to be bottleneck in cross subnet performance scenarios).

3

This is also the reason why on Azure there is absolutely no restriction on number of subnets that you could create. Always remember, limitations on cloud are enforced when either (a) the feature uses some resources OR (b) the feature has design limitations. In case of subnets, having no restriction on number of subnets clearly tells that neither there is any design restriction nor is Azure over stressed with extra resources.

However, if you create separate networks on Azure (similar to above design), there will be performance implications on Azure too because (a) communication between two n/w happen via dedicated Gateways which can be performance blockers and (b) the resources themselves may be physically far away from each other affecting cross network communication latency.

2. The second thing that you will notice in above network design is that once resources have been organized into separate networks, security boundaries are enforced. For example, inbound internet traffic is allowed DMZ network (of course with inspection rules) but access to any other network directly from internet is not allowed. Public facing network (the DMZ) can access application network but no other internal network (app network is providing cushion between DMZ and internal resources). App network can access internal networks.

Azure Design Implication:

After our discussion (in point 1) on separating resources via dedicated networks vs subnets within same n/w, I’m assuming we are going via subnet model. Now, on Azure, this kind of communication restrictions between subnets are generally implemented using two approaches:

Approach #1: Using Network Security Groups (NSG):

In this approach, we simply implement the rules of communication using NSG and apply them to subnets. However, NSGs are simple packet filtering rules and applied at network and transport layers. However these cannot handle advanced threats that happen at higher layers of network stack. For example, NSGs cannot detect if an intrusion has occurred or if some malicious code has already entered in to your system (for example a bot that is trying to do DOS attack on your server from inside network). These advanced scenarios are handled by security appliances that work at higher layers analyzing packets that are closer to your application. At lower layers, you might even not see any issue because all you do at lower level is analyze IP address, port, layer 4 protocol used (the famous 5-tuple inspection done by NSGs).

Approach #2: Using Network Security Appliances (NSA) AND User Defined Routes (UDR)

In this approach, we install special virtual devices called NSA. NSA come from different vendors like Barracuda, Trend etc. These appliances come in different forms and features. Some are firewalls, some have capability for Intrusion Detection (ID) and Intrusion Prevention (IP), some are simple host antiviruses… so depending on your need you would buy one. Unless you bought a host based NSA (that gets installed on same machine where your apps are) you will end up using one of the dedicated resource (maybe a dedicated VM). As an example, let’s say you want to use a firewall that is configured with your advanced rules for restricting your resource’s outside communication or intra-subnet communication. In order to make sure that this NSA can apply those rules, you need to make sure that all communication goes via this NSA. This is implemented on Azure using UDR. So, in a nutshell, you do following:

–> Buy and install a Network Security Appliance (NSA)
–> Configure NSA for your needs
–> Create user defined routes (UDR) to make sure that all communication passes via NSA

However, be aware that when we implement NSAs, we are effectively funneling entire communication through a single compute node. So, make sure to evaluate performance criteria before implementing NSA using UDR. For example, make sure the NSA that you buy has scalability option, make sure to perform load testing especially scenarios that involve inter-subnet communication and NSA limiting it.

Placing Resources Close Together on Azure

I’ll detail a little bit more on Azure Networks, Subnets, distance between resources and performance/latency implications. As I mentioned earlier, as long as your resources belong to same network, communication between them will be fast. However, there are few tricks that you can use for further enhance communication between resources. Earlier, Azure offered the concept of “affinity groups” that allowed us to tell Azure to place all resources to a single “scale unit”. Scale unit was generally considered to be compute clusters that are close together. So you may end up with all your resources within same rack or even within same host machine (if you did not use availability groups). However, now that affinity groups are no longer recommended it is important to understand what is the replacement to that feature.

In scenarios, where latency between two server resources is “ultra, super… important”, you still need to go with concept similar to affinity group. However, unfortunately, there is no replacement for that feature. Now your resources could end up being farther away from each other even if they are within same network. A network is “regional” now which means you would typically select, say, “Central US” for your network and resources. However within a region there could be multiple datacenters hundreds of miles apart and within a datacenter too we may have to traverse multiple routers between two racks (generally the case where datacenter is very large… you cannot have all the resources within same network and end up with scenarios like packet flooding by a faulty node or other similar situations affecting entire datacenter).

So, now that affinity groups are not available (at least I don’t see on any or azure portal but may be available via PowerShell), we can make sure that the VMs that we create are of same size and belonging to same region/network. Azure has compute units in its datacenters and they are available in clusters at single location. So for example, if Microsoft introduces a new fancy VM in Central US region, it is possible that it is available only in a particular datacenter. Now if you create two VMs of which one is plain old mid-sized VM and other one is  the latest one that Microsoft introduced, chances are higher that they will be placed at far apart distance (though within same region). So it is recommended that if intra resource communication is super important, then stick to same size for all individual resources. Of course this is not a guarantee but still a more favorable chance that will continue to be like that even in future. Microsoft itself tries to create compute units as close as possible but if we do not leave a chance to them, we will end up with farther away machines. There are many other factors that influence the datacenter that Microsoft choses for your machine, for example, availability of resources of a particular architecture in a datacenter.

An Example Scenario:

Without naming the organization, I want to give an example when we actually used above concepts. In one of the large scale Azure implementation, the customer knew the load that they will expect on particular dates. The customer wanted to make sure that the machines are close together when we scale out for additional load demands. Now, if we used regular approach of dynamic scaling (increase compute units only when CPU% is ABC etc.) then we would have ended up with compute units in different datacenters as we knew that the datacenter closest to us is small. So, with help of Microsoft Products Groups and Business Groups (since it was a very large scale deployment we got extra support), we created VMs of a particular size (we were told the size that a specific datacenter supports) and created them few days in advance. After that we did not shut down the VMs because if we shut them down, next time we may get a different datacenter. After passing the high load period, we switched back to regular auto scaling approach. One can always say that what if those VMs faced any runtime issues and Azure needed to re-provision them… after all this is what we are taught (that cloud is commodity hardware and we should be ready for compute failures etc.). Well, the reality is a little different. In reality, all datacenters are not alike… some use commodity hardware while some don’t. And compute units failing is not as frequent as we are told, if we land on better datacenters. And even if few compute units did fail, is not catastrophic… we actually overprovisioned to what we needed.

As mentioned earlier, I will continue to analyze other network design approaches in later parts otherwise a single article would become too big to be read.

Key Takeaway:

  • Resources are organized into networks or subnets based on different business needs like performance, security and management. It is important to evaluate the impact of your network design over these aspects.
  • If resources are too far away (geographically) or separated by different networks (via dedicated routers) then performance of communication between them is affected.
  • On Azure, as long as two resources are within same virtual network, the communication latency between them will be low (hence better performance).
  • On Azure, separating resources by subnets (in same network) does not affect their communication performance. But be careful, an exception to this scenario is using NSA appliances with UDR in which case communication performance is affected by the throughput offered by the NSA devices.
  • When we connect PaaS components to IaaS components (for example Azure web app with Azure vNets), this happens via VPNs. Make sure that the PaaS component is within same region as the IaaS component otherwise there will be communication latency between resources. One quick tip to ensure this is to keep all resources within same resource group (RGs are ties to a specific region).
  • On Azure, we have control over resources being as close as being in a single “region”. Earlier we had concept of “affinity group” which provided control at datacenter level but it is deprecated now. If we want resources to be really close to each other, try to create them of similar configuration (size, features used etc.). “Chances” are (not guaranteed) that your resources will be placed within same scale unit or at least within same datacenter.
  • It is common misconception that Azure NSGs are “weaker” or not very secure way to protect your resources on Azure. It is important to understand that NSGs offer a different capability set and operate at different layer of OSI model. If you need fine grained protection, you’d need to intercept packets at higher layer which NSAs can do. Each component has its own place and purpose and they complement each other in overall security (firewalls, antivirus, security center, UDR, NSG, NSA etc.). Expecting NSGs to meet NSA needs is not a good approach.
  • One of the key aspect to improve communication performance is to keep resources close to each other and close to audience. Keeping them within same network, having uniform size of resources, scaling in advance and one unit a time, keeping within same subnet (when using NSA) or increasing throughput of NSAs when in different subnets, selecting same region (when using VPN) top connect to network) are some of the design aspects that will create a solid design that offers minimum communication latency from infrastructure perspective. There are many other things that we can do from application design perspective (e.g., caching) but that will diverge the topic of discussion.
  • Other tips to enhance communication performance are to use standard features offered by Azure like High Speed Gateways for your VPNs, Express routes, using bigger compute units for more bandwidth, using SSDs etc. (of course they mean extra cost) but those are “features” and jot really talking about “design” aspects so I’ve skipped them for now. Those may come as part of my other mini articles in future.

An ideal network on Azure having incorporated above designs may look like below:

2

Here,

  • “Common” resources that require “small” and “less frequent” data transfers are put in a separate network (corp network). In this scenario, these resources were also protected so we did not create any public endpoint so the network is inaccessible from public internet.
  • All application specific resources are put within same network and same Resource Group (so that they are in same region… not shown in figure). These resources are separated by subnets for security needs. Usage of User Defined Routes with Security Appliances is optional but can be used. NSGs are used to filter packets between subnets before it even reached security appliance.
  • Corp network is connected to application network via VPN to enhance security. POaaS resources also communicated to Azure n/w via VPN. Public internet is restricted to access only specific resources in DMZ subnet of app network.

Important:

There may be scenarios where some “common/shared” resources requires heavy data usage by application. In such scenario, the recommended option is to use development designs (like caching) to avoid hitting resources of another network. If that is not a feasible approach, then we can use infrastructure solutions like replication where we keep a copy of the shared resource in app network and maintain a synch schedule based on business needs. Performing bandwidth heavy frequent operations to resources of another is the last thing you want to do.

I hope you enjoyed this article that talks about network design aspects for enhancing performance (communication latency) between different components of your network on Azure.

In the next article, we’ll take a look at network design from “security” perspective.

This series is also mirrored at blogs.comtecinfo.com. The links in this article below will be updated as and when new articles are posted.

-Rahul Gangwar
https://www.linkedin.com/in/gangwar

Advertisements

2 thoughts on “Designing Networks – Part 1 – Performance/Latency

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s