Workloads To The Cloud – Mission Critical Systems

While attending IBM’s Think 2020 Digital Experience, I heard from both IBM and Accenture that companies had only migrated 20% of workloads to the cloud.  What I then heard from them caught my attention – “The hard stuff is what is next.” 

The “hard stuff” that they referred to are the mission-critical applications that keep companies up and running.  One thing that I learned during my tenure in IT Operations is that not all systems are created equal.  While a severity one outage with a mission-critical application captures the attention of the entire operations staff as well as several layers of management, no one may even notice a lower-ranked system is down.  I’ve heard of multiple incidents where a member of an infrastructure team had to inform a support team that their application had been down for multiple weeks (or longer).  With a mission-critical application, your customers notify you that your system is down if you don’t already know.   So, when IBM and Accenture say that “the hard stuff is what is next”—this is going to get interesting!

What is the “Hard Stuff” in moving workloads to the cloud

When designing a messaging solution for a mission-critical application, you need to address some fundamental concerns.  Those concerns include latency tolerance, data sequencing affinity, data loss tolerance, data security, throughput requirements, and data duplication tolerance.  When the complexity of a hybrid, multi-cloud migration gets added to the equation, the difficulty in resolving those factors increases dramatically. 

Hard Stuff:  Increased number of dependencies

Once you leave the confines of a single data center deployment, new headaches emerge as you become more dependent upon other parties.  For example, one company where I worked utilized two physical data centers with a 13-millisecond network latency.  By adding that technical requirement, we not only added a dependency on the telco providing the networking service, but we also added a dependency on whichever construction company or governmental agency was working on the roadways or railways where the fiber had been laid.  Whenever there was a fiber cut, we needed to switch over to the backup route, which had a latency of 80-milliseconds.  With the performance of TCP degrading rapidly over high latency, utilizing the backup network route was similar to completely losing that second data center. 

Hard Stuff:  New characteristics of the cloud

Now, when you add to the equation the dynamic nature of cloud providers, some of that “hard stuff” comes into a more unobstructed view.  In the cloud, virtual machines may just go away, for example.  If something like that happened in a private data center, the compute support team would be spending sleepless nights making sure it never happens again.  When your compute is in the cloud, and you lose a virtual machine, the most you are expected to say is, “Oh.” 

When moving mission-critical workloads to the cloud and having to deal with an intolerance for increased latency, ensuring data sequence is maintained, and also ensuring that data is secure from eavesdropping, yeah, that’s hard—but it isn’t impossible. 

While cloud-based integration is different than on-premise, monolithic application integration, there are still some concepts that we utilize in messaging from on-premise deployments that are key to a successful deployment in the cloud. While addressing the new aspects of cloud-based integration, I suggest you rely upon those concepts as your anchor point from which to work.  Those concepts are service orientation, dynamic routing, and continuous availability.

Service Orientation versus Location Orientation

You are most likely in trouble when you are having discussions about which specific cloud region where an application needs to direct messages for its downstream services.  The reason is that the upstream application shouldn’t know where its dependencies are currently running.  If each upstream application has that awareness, failing a service from one region to another would require a significant amount of coordination.  Alternatively, a centralized operations control staff would need to be aware of any primary, secondary, and tertiary locations for each service so that in the event of an outage in the primary site, they can redirect the message flow.  Neither of those options is ideal because they most likely involve configuration changes. 

Ideally, the location of the primary, secondary and tertiary sites for each service is isolated from the application calling the service.  Also, while a centralized operations staff should have visibility to where services are running and be able to take specific services down when necessary, their involvement should be minimal to avoid redirecting messages to the incorrect location.  When using a transport technology such as IBM MQ, the sending application writes a message to a queue.  Where the message ends up is irrelevant as long as it gets processed according to the Operational Level Agreement.  The architectural goal is to maintain existing configurations, but to have a method of redirecting traffic away from specific instances or regions. 

Dynamic Message Routing

Once a requesting application sends a message, there needs to be a dynamic routing ability to get the messages to the correct service wherever that service is currently running.  The driver for this may take on many different forms.  Due to data sovereignty laws, the infrastructure may need to route certain data to specific locations.  Also, due to how a company may segment its customer base among different cloud regions due to latency sensitivities, the data may need to pass through a content-based routing solution typically included in an enterprise service bus.  Those are the kinds of issues that surface on a day-in-day-out basis.  Also, there is the need to dynamically re-route messages to support regular operations.  There is also the need to handle outages of either a single service or for an entire cloud region. 

Continuous Availability

While integrating legacy, monolithic applications typically involve designing applications to be highly available and, at times, to have disaster recovery plans so that support teams can restore a service within days, the concept still exists in the cloud. But it has been taken up a notch (or several notches).  There is an expectation that applications running in the cloud are continuously available.  Fulfilling that expectation can take on a few different forms.  The first is that several instances of the application are running in parallel in multiple cloud regions.  In that case, the infrastructure would need to distribute the transaction load to all running instances.  In another case, an application instance in one specific region is taking all the transaction load, and instances that exist in the other areas serve as standby/failover instances.  

Redundancy Approaches

The first case is one that is probably most familiar to people.  There is a load balancer in front of different application instances running in different cloud regions.  As a message comes into the load balancer, the load balancer determines the next application instance to send a message to and off it goes.  That’s fine as long as that application’s processing time is relatively uniform.  Otherwise, the load balancer may forward a message to an instance that doesn’t have any threads available.  Most companies wouldn’t necessarily have that issue, but in very high transaction count companies with a wide variance in processing times–it can be a problem. 

The most challenging situation involves applications with message affinities that are primary in one cloud region but are running standby instances in the other areas.  Two things need to occur.  First, messages need to be routable to both the primary and secondary locations within a reasonable amount of latency.  That need is so the infrastructure can assure that messages can be received at the appropriate endpoint regardless of which one is active.  Second, the infrastructure needs to replicate messages delivered to the primary instance to the secondary instance.  By taking those two steps, the infrastructure can resolve the affinities.

Redundancy Levels

With both of these scenarios, there should be two levels of redundancy.  The first is that sufficient capacity should exist within a specific region to tolerate some instability.  From a messaging standpoint, there should be sufficient IBM MQ queue managers within a particular region to allow for some of them to be down for maintenance, and the applications using those queue managers should not be alarmed by that state.  The second level is when an administrative action is issued to take down an entire region.  There should be at least one other entire region available to process the transaction load.  The case where there are message affinities is more sensitive and needs to be done in a coordinated effort – hopefully via automation.  In both cases, though, the intent should be to perform those region-level failovers without impact to your customers. 

Working with Application Architects when moving workloads to the cloud

While application architects can be very knowledgeable about the technical architecture they have worked with on-premise, their knowledge may not translate to cloud-based deployments when moving workloads to the cloud. While I focused this article on messaging, there are other aspects application architects must consider.  To assist them in designing their overall approaches, you, as a messaging architect, should allocate time to educate the application architects on how the technical architecture changes as part of a cloud migration.

Moving a company’s “bread and butter” systems to the cloud is something that shouldn’t be taken lightly.  However, it isn’t impossible.  By building upon experience gained in performing integration on-premise and then adding the lessons learned in moving non-critical workloads to the cloud, you have a good foundation.