Manage Latency Changes Between Cloud Regions

After living over twenty years in Tornado Alley, USA, I have learned one important lesson.  Bad things can happen at the edges of clouds.  Cloud edges are where the destructive power of straight-line winds and tornadoes wreak their havoc.  In an IT, multi-cloud migration strategy, those cloud edges can likewise be very problematic.  When you need to get data from one cloud region to another, the strategy that you employ has consequences.  For example, how does the strategy allow you to manage latency changes between cloud regions? The consequences you will need to deal with can affect your stability as well as your cost profile. 

When implementing a cloud migration strategy, most IT disciplines focus on the inside of a cloud region. As a messaging and integration architect, the focus is different.  The sad part is that the cross-region integration ends up getting neglected and, most likely, will be devoid of the necessary engineering constructs to ensure stability and security.  By ignoring that integration, you could also end up causing additional spending on computing resources when you least need to increase your expenses.  As a messaging architect for a multi-cloud deployment, my concern is on moving data from one cloud region or one cloud provider to another.  Moving data inside a region or zone almost becomes inconsequential.  The reason is that the risk profiles facing intra-region communications are vastly lower than inter-region communications. 

The Prevalent Approach

A prevalent approach that is used in cloud deployments is to leverage HTTP as a transport to move messages from one cloud region to another.  In its purest form, an application in region A executes a REST service provided by an application in region B. To ensure that the application in region A has sufficient capacity to handle its incoming requests, the capacity plan for the application in region A needs to account for, not only, the processing time for the application in region B, but also the round-trip network latency to get the messages from region A to region B and back. 

A screenshot of a social media post

Description automatically generated
Prevalent Approach without Messaging

In a previous blog post, I discussed some of the ‘hard things’ that must be dealt with when moving mission-critical workloads to the cloud.  One of those items is the networking between the cloud regions.  If your capacity plan for the application in region A assumes a 15 ms latency between regions A & B, what happens when the latency jumps from 15 ms to 60+ms due to a network failover?  In all likelihood, the application in region A will experience thread exhaustion or will spin up enough new instances to handle the workload while overcoming the increase in latency.  What happens in either of those cases?  With thread exhaustion, you most likely will be in an outage, and depending upon your agreements with your customers—you may be paying a penalty.  With the case where additional application instances are created to handle the slowdown, you need to pay for those VMs.  So, in both cases, expenses are increased to deal with something that should be factored into the technical architecture of your system.  The question that needs to be asked is, “How can we manage changes in latency between cloud regions cost-effectively?” 

What’s Need to Better Manage Latency Changes

One answer to that question involves adding another component to the system.  The intent behind adding this new component should include the following:

  1. Provide a mechanism to support nearly identical response times within a cloud region regardless of latency changes between cloud regions
  2. Provide a mechanism to compensate for changes in latency between cloud regions that is encapsulated so that applications running in each cloud region are not affected
  3. Provide a reusable component so that all applications needing inter-cloud region communication will not need to reinvent the wheel. 

So, with this new component, there needs to be a mechanism to overcome changes in latency.  Since the slowdown would occur in the network layer, there isn’t a significant increase in demand for computing resources.  Therefore, spinning up new VMs like you would need to do for handling additional volume, is overkill.  What typically needs to happen, though, is that additional parallel processing needs to occur.  That need for increased parallel processing is what drives people to spin up new VMs when they hadn’t correctly planned their technical architecture for network slowdowns.  When trying to overcome network slowdowns from 15 ms to 60 ms, 80 ms, etc., all that is needed are some new lightweight processing threads to handle the network traffic. 

A More Efficient Approach

One solution to this issue is to deploy IBM MQ into that additional technical architecture component.  Having been around for a few decades, IBM MQ was designed to help overcome the problems of unreliable and high latency networks.  In other words, it’s a perfect fit for inter-cloud region communications. 

With applications connecting to queue managers inside the specific cloud region that they are running, the app to queue manager latency will remain reasonably stable.  When the application puts the outgoing message on a transmit queue, that application thread frees up to either continue processing or to post a read on the reply-to queue.  Either way, those operations should all be completed in a reasonably consistent timeframe.  So, once the application does get that message on the transmit queue, it is up to IBM MQ to get the message to and from the remote cloud region in a reasonable timeframe. 

For this illustration, I’ll be focusing my attention on the functionality that would be available with MQ Clustering.  The reason for that is the dynamic nature of changes that MQ Clustering provides.  The figure below illustrates the addition of the new component in the architecture.

A screenshot of a cell phone

Description automatically generated
Messaging Component Layer

The queue managers in each region both participate in the same MQ cluster.  With this configuration, you’d have one cluster receiver channel definition for each queue manager.  You’d also have a cluster sender channel defined on each queue manager that is pointed to the full cluster repositories (I have not illustrated the full repositories to keep the picture simple).  When communication needs to be established between those queue managers, each queue manager’s cluster receiver channel definition is used to auto-create a cluster sender channel from one of these queue managers to the other. 

Running Parallel Channel Instances

This approach provides you two different methods to overcome increased latency.  The first is to create additional cluster receiver channels on each queue manager.  Assume that the initial cluster receiver channel is defined with the name TO.QMA.CLUSTERA.  You can copy that channel definition to create TO.QMA.CLUSTERA.1, TO.QMA.CLUSTERA.2.  As soon as they are created and replicated to the repositories, they are started up and start transferring data.  With the appropriate monitoring in place, the definition of these new channels can be handled via automation.  You could end up with a configuration similar to the figure below.

A screenshot of a cell phone

Description automatically generated
Adaptable Messaging Layer

Compressing Message Payloads

The other option can be used either independently from, or in conjunction with, creating additional channel definitions.  The other option is message compression.  Compressing message payloads takes time, so it is something that you want to avoid unless it is needed.  When latency increases drastically, it usually is a sign that compression may help.  When selecting a compression algorithm, you should consider ZLIBFAST because it does provide a reasonably effective compression ratio at a reasonable cost.  RLE is another option, but its effectiveness is highly dependent upon the data being compressed. 

With both of these options, the key takeaway is that by putting the right architecture in place, you would be better able to deal with the inevitable need to manage latency changes between cloud regions. The use of IBM MQ as an illustration for solving the issue discussed in this article was also intended to show that we have some solutions already available for the daunting task of moving some of the rest of the 80% of corporate workloads to the cloud. 

If you would be interested in discussing your specific cloud migration project, please reach out via the form on my Services page. 

One Reply to “Manage Latency Changes Between Cloud Regions”

Leave a Reply

Your email address will not be published. Required fields are marked *