IBM Leapfrogs Other Public Cloud Providers with Targeted Ecosystem

While AWS and Azure are recognized as the most prominent players in the public cloud wars, IBM has taken a solid foothold in the public cloud space.  AWS and Azure have created environments for companies to build scalable, robust infrastructure.  IBM’s recent tactic has been to create an environment for companies to do much more—participate in an industry ecosystem.  

In creating the financial services ecosystem within their cloud environment, IBM has established something that the other cloud providers have yet to make—context.  Each industry has its own set of nuances, challenges, and forces that act upon it.  Those issues impact every company that participates in that industry.  No single company is immune.  But through creating that industry context, IBM has also given ISVs a motivation for building their capabilities in the IBM Cloud.  

The benefit of an ecosystem 

IBM has a triple win.  IBM established an ecosystem that banks are motivated to adopt due to security and regulatory functionality. ISVs are encouraged to offer products in the ecosystem due to the number of customers in that ecosystem.  With many ISVs providing services in the ecosystem, more financial services could join the ecosystem.  Soon the IBM Cloud for Financial Services could benefit from a network effect of having a large footprint of clients.  

The critical aspect of IBM’s cloud offering is that they have created an environment where interdependent components of, in this case, the financial services industry, can work together.  By deploying this ecosystem on a cloud computing platform, IBM has allowed this interdependent behavior to be consumed on a ‘pay-as-you-go’ basis.  

The IBM Financial Services ecosystem must attract a wide variety of stakeholders to be successful.  The foundation of the ecosystem is a set of established companies.  However, just attracting established companies is not sufficient. The ecosystem must also possess a circle of life.  The circle of life, in this case, is the creation of new businesses that provide updated functionality to the industry.  Therefore, the ecosystem members must include universities, the developer community, accelerators, and startup companies.  Each member of the ecosystem becomes reliant upon the ecosystem for its continued viability.  

Universities need platforms for teaching their students and research.  Developer communities need application programming interfaces.  Accelerators need tools and capabilities. To sell their products, startup companies need a market.  Established companies need new functionality provided by startups, market opportunities for growth, and workers from the universities and developer communities.  

Why cloud computing isn’t enough

The strength of cloud computing is in its flexibility.  Cloud computing allows a company to scale up and scale down its resources as necessary based upon traffic demand.  Handling traffic demand is a technology problem.  It isn’t a business problem.  

When Harvard Business School professor Michael Porter presented the value chain concept, he listed technology as a supporting activity to the primary activities.  The primary activities are inbound and outbound logistics, operations, marketing and sales, and service.  Companies participating in an industry ecosystem have access to the external processes that drive growth in that industry.  Companies participating in an industry ecosystem also have access to the tools, customized for that ecosystem, that increase business productivity internal to the company. 

When a company migrates their technical infrastructure to the cloud, they gain quite a few efficiencies. However, the basic process largely remains the same.  The technical service owner identifies the need for a new server, they request a new server, they request software be installed on that server. Then the server gets introduced to the production environment. A business can introduce a cloud-based server into production in a fraction of its time to introduce a server into a legacy data center.   

It is a locally optimized solution.  

However, compared to an ecosystem, just leveraging cloud computing will likely not be enough to compete—especially when all industry participants migrate to the cloud.  As an example, a business can shorten a process’ cycle time through automation.  The highest level of improvement possible with automating a given function is a 100% cycle time reduction.  However, by reengineering the entire process,  a business can reduce cycle time by orders of magnitude.  

Ecosystems provide that orders of magnitude impact where cloud computing in itself does not.  

What to watch for

Companies are working toward maturing their enterprise architectures to increase their ability to execute on their strategy.  To date, companies have been focused internally on that maturation process.  However, there will come a time when companies will need to turn their focus externally.  The goal of a company looking beyond itself will be to have seamless integration with external business partners.  With that seamless integration, the core company will have the ability to swap in and out entire companies that would comprise a virtual corporation.  

That level of integration is only possible when an ecosystem is built upon a common object and interaction model.  If that comes to pass, the amount of flexibility that established companies will possess would be outstanding.  Companies could introduce new functionality quickly. Startup companies would have fewer barriers to marketing and selling their products.  The introduction of innovation could be constant.  

Disclaimer:  While not an employee of IBM, the author is a 2020 IBM Cloud Champion.  

Manage Latency Changes Between Cloud Regions

After living over twenty years in Tornado Alley, USA, I have learned one important lesson.  Bad things can happen at the edges of clouds.  Cloud edges are where the destructive power of straight-line winds and tornadoes wreak their havoc.  In an IT, multi-cloud migration strategy, those cloud edges can likewise be very problematic.  When you need to get data from one cloud region to another, the strategy that you employ has consequences.  For example, how does the strategy allow you to manage latency changes between cloud regions? The consequences you will need to deal with can affect your stability as well as your cost profile. 

When implementing a cloud migration strategy, most IT disciplines focus on the inside of a cloud region. As a messaging and integration architect, the focus is different.  The sad part is that the cross-region integration ends up getting neglected and, most likely, will be devoid of the necessary engineering constructs to ensure stability and security.  By ignoring that integration, you could also end up causing additional spending on computing resources when you least need to increase your expenses.  As a messaging architect for a multi-cloud deployment, my concern is on moving data from one cloud region or one cloud provider to another.  Moving data inside a region or zone almost becomes inconsequential.  The reason is that the risk profiles facing intra-region communications are vastly lower than inter-region communications. 

The Prevalent Approach

A prevalent approach that is used in cloud deployments is to leverage HTTP as a transport to move messages from one cloud region to another.  In its purest form, an application in region A executes a REST service provided by an application in region B. To ensure that the application in region A has sufficient capacity to handle its incoming requests, the capacity plan for the application in region A needs to account for, not only, the processing time for the application in region B, but also the round-trip network latency to get the messages from region A to region B and back. 

A screenshot of a social media post

Description automatically generated
Prevalent Approach without Messaging

In a previous blog post, I discussed some of the ‘hard things’ that must be dealt with when moving mission-critical workloads to the cloud.  One of those items is the networking between the cloud regions.  If your capacity plan for the application in region A assumes a 15 ms latency between regions A & B, what happens when the latency jumps from 15 ms to 60+ms due to a network failover?  In all likelihood, the application in region A will experience thread exhaustion or will spin up enough new instances to handle the workload while overcoming the increase in latency.  What happens in either of those cases?  With thread exhaustion, you most likely will be in an outage, and depending upon your agreements with your customers—you may be paying a penalty.  With the case where additional application instances are created to handle the slowdown, you need to pay for those VMs.  So, in both cases, expenses are increased to deal with something that should be factored into the technical architecture of your system.  The question that needs to be asked is, “How can we manage changes in latency between cloud regions cost-effectively?” 

What’s Need to Better Manage Latency Changes

One answer to that question involves adding another component to the system.  The intent behind adding this new component should include the following:

  1. Provide a mechanism to support nearly identical response times within a cloud region regardless of latency changes between cloud regions
  2. Provide a mechanism to compensate for changes in latency between cloud regions that is encapsulated so that applications running in each cloud region are not affected
  3. Provide a reusable component so that all applications needing inter-cloud region communication will not need to reinvent the wheel. 

So, with this new component, there needs to be a mechanism to overcome changes in latency.  Since the slowdown would occur in the network layer, there isn’t a significant increase in demand for computing resources.  Therefore, spinning up new VMs like you would need to do for handling additional volume, is overkill.  What typically needs to happen, though, is that additional parallel processing needs to occur.  That need for increased parallel processing is what drives people to spin up new VMs when they hadn’t correctly planned their technical architecture for network slowdowns.  When trying to overcome network slowdowns from 15 ms to 60 ms, 80 ms, etc., all that is needed are some new lightweight processing threads to handle the network traffic. 

A More Efficient Approach

One solution to this issue is to deploy IBM MQ into that additional technical architecture component.  Having been around for a few decades, IBM MQ was designed to help overcome the problems of unreliable and high latency networks.  In other words, it’s a perfect fit for inter-cloud region communications. 

With applications connecting to queue managers inside the specific cloud region that they are running, the app to queue manager latency will remain reasonably stable.  When the application puts the outgoing message on a transmit queue, that application thread frees up to either continue processing or to post a read on the reply-to queue.  Either way, those operations should all be completed in a reasonably consistent timeframe.  So, once the application does get that message on the transmit queue, it is up to IBM MQ to get the message to and from the remote cloud region in a reasonable timeframe. 

For this illustration, I’ll be focusing my attention on the functionality that would be available with MQ Clustering.  The reason for that is the dynamic nature of changes that MQ Clustering provides.  The figure below illustrates the addition of the new component in the architecture.

A screenshot of a cell phone

Description automatically generated
Messaging Component Layer

The queue managers in each region both participate in the same MQ cluster.  With this configuration, you’d have one cluster receiver channel definition for each queue manager.  You’d also have a cluster sender channel defined on each queue manager that is pointed to the full cluster repositories (I have not illustrated the full repositories to keep the picture simple).  When communication needs to be established between those queue managers, each queue manager’s cluster receiver channel definition is used to auto-create a cluster sender channel from one of these queue managers to the other. 

Running Parallel Channel Instances

This approach provides you two different methods to overcome increased latency.  The first is to create additional cluster receiver channels on each queue manager.  Assume that the initial cluster receiver channel is defined with the name TO.QMA.CLUSTERA.  You can copy that channel definition to create TO.QMA.CLUSTERA.1, TO.QMA.CLUSTERA.2.  As soon as they are created and replicated to the repositories, they are started up and start transferring data.  With the appropriate monitoring in place, the definition of these new channels can be handled via automation.  You could end up with a configuration similar to the figure below.

A screenshot of a cell phone

Description automatically generated
Adaptable Messaging Layer

Compressing Message Payloads

The other option can be used either independently from, or in conjunction with, creating additional channel definitions.  The other option is message compression.  Compressing message payloads takes time, so it is something that you want to avoid unless it is needed.  When latency increases drastically, it usually is a sign that compression may help.  When selecting a compression algorithm, you should consider ZLIBFAST because it does provide a reasonably effective compression ratio at a reasonable cost.  RLE is another option, but its effectiveness is highly dependent upon the data being compressed. 

With both of these options, the key takeaway is that by putting the right architecture in place, you would be better able to deal with the inevitable need to manage latency changes between cloud regions. The use of IBM MQ as an illustration for solving the issue discussed in this article was also intended to show that we have some solutions already available for the daunting task of moving some of the rest of the 80% of corporate workloads to the cloud. 

If you would be interested in discussing your specific cloud migration project, please reach out via the form on my Services page. 

Workloads To The Cloud – Mission Critical Systems

While attending IBM’s Think 2020 Digital Experience, I heard from both IBM and Accenture that companies had only migrated 20% of workloads to the cloud.  What I then heard from them caught my attention – “The hard stuff is what is next.” 

The “hard stuff” that they referred to are the mission-critical applications that keep companies up and running.  One thing that I learned during my tenure in IT Operations is that not all systems are created equal.  While a severity one outage with a mission-critical application captures the attention of the entire operations staff as well as several layers of management, no one may even notice a lower-ranked system is down.  I’ve heard of multiple incidents where a member of an infrastructure team had to inform a support team that their application had been down for multiple weeks (or longer).  With a mission-critical application, your customers notify you that your system is down if you don’t already know.   So, when IBM and Accenture say that “the hard stuff is what is next”—this is going to get interesting!

What is the “Hard Stuff” in moving workloads to the cloud

When designing a messaging solution for a mission-critical application, you need to address some fundamental concerns.  Those concerns include latency tolerance, data sequencing affinity, data loss tolerance, data security, throughput requirements, and data duplication tolerance.  When the complexity of a hybrid, multi-cloud migration gets added to the equation, the difficulty in resolving those factors increases dramatically. 

Hard Stuff:  Increased number of dependencies

Once you leave the confines of a single data center deployment, new headaches emerge as you become more dependent upon other parties.  For example, one company where I worked utilized two physical data centers with a 13-millisecond network latency.  By adding that technical requirement, we not only added a dependency on the telco providing the networking service, but we also added a dependency on whichever construction company or governmental agency was working on the roadways or railways where the fiber had been laid.  Whenever there was a fiber cut, we needed to switch over to the backup route, which had a latency of 80-milliseconds.  With the performance of TCP degrading rapidly over high latency, utilizing the backup network route was similar to completely losing that second data center. 

Hard Stuff:  New characteristics of the cloud

Now, when you add to the equation the dynamic nature of cloud providers, some of that “hard stuff” comes into a more unobstructed view.  In the cloud, virtual machines may just go away, for example.  If something like that happened in a private data center, the compute support team would be spending sleepless nights making sure it never happens again.  When your compute is in the cloud, and you lose a virtual machine, the most you are expected to say is, “Oh.” 

When moving mission-critical workloads to the cloud and having to deal with an intolerance for increased latency, ensuring data sequence is maintained, and also ensuring that data is secure from eavesdropping, yeah, that’s hard—but it isn’t impossible. 

While cloud-based integration is different than on-premise, monolithic application integration, there are still some concepts that we utilize in messaging from on-premise deployments that are key to a successful deployment in the cloud. While addressing the new aspects of cloud-based integration, I suggest you rely upon those concepts as your anchor point from which to work.  Those concepts are service orientation, dynamic routing, and continuous availability.

Service Orientation versus Location Orientation

You are most likely in trouble when you are having discussions about which specific cloud region where an application needs to direct messages for its downstream services.  The reason is that the upstream application shouldn’t know where its dependencies are currently running.  If each upstream application has that awareness, failing a service from one region to another would require a significant amount of coordination.  Alternatively, a centralized operations control staff would need to be aware of any primary, secondary, and tertiary locations for each service so that in the event of an outage in the primary site, they can redirect the message flow.  Neither of those options is ideal because they most likely involve configuration changes. 

Ideally, the location of the primary, secondary and tertiary sites for each service is isolated from the application calling the service.  Also, while a centralized operations staff should have visibility to where services are running and be able to take specific services down when necessary, their involvement should be minimal to avoid redirecting messages to the incorrect location.  When using a transport technology such as IBM MQ, the sending application writes a message to a queue.  Where the message ends up is irrelevant as long as it gets processed according to the Operational Level Agreement.  The architectural goal is to maintain existing configurations, but to have a method of redirecting traffic away from specific instances or regions. 

Dynamic Message Routing

Once a requesting application sends a message, there needs to be a dynamic routing ability to get the messages to the correct service wherever that service is currently running.  The driver for this may take on many different forms.  Due to data sovereignty laws, the infrastructure may need to route certain data to specific locations.  Also, due to how a company may segment its customer base among different cloud regions due to latency sensitivities, the data may need to pass through a content-based routing solution typically included in an enterprise service bus.  Those are the kinds of issues that surface on a day-in-day-out basis.  Also, there is the need to dynamically re-route messages to support regular operations.  There is also the need to handle outages of either a single service or for an entire cloud region. 

Continuous Availability

While integrating legacy, monolithic applications typically involve designing applications to be highly available and, at times, to have disaster recovery plans so that support teams can restore a service within days, the concept still exists in the cloud. But it has been taken up a notch (or several notches).  There is an expectation that applications running in the cloud are continuously available.  Fulfilling that expectation can take on a few different forms.  The first is that several instances of the application are running in parallel in multiple cloud regions.  In that case, the infrastructure would need to distribute the transaction load to all running instances.  In another case, an application instance in one specific region is taking all the transaction load, and instances that exist in the other areas serve as standby/failover instances.  

Redundancy Approaches

The first case is one that is probably most familiar to people.  There is a load balancer in front of different application instances running in different cloud regions.  As a message comes into the load balancer, the load balancer determines the next application instance to send a message to and off it goes.  That’s fine as long as that application’s processing time is relatively uniform.  Otherwise, the load balancer may forward a message to an instance that doesn’t have any threads available.  Most companies wouldn’t necessarily have that issue, but in very high transaction count companies with a wide variance in processing times–it can be a problem. 

The most challenging situation involves applications with message affinities that are primary in one cloud region but are running standby instances in the other areas.  Two things need to occur.  First, messages need to be routable to both the primary and secondary locations within a reasonable amount of latency.  That need is so the infrastructure can assure that messages can be received at the appropriate endpoint regardless of which one is active.  Second, the infrastructure needs to replicate messages delivered to the primary instance to the secondary instance.  By taking those two steps, the infrastructure can resolve the affinities.

Redundancy Levels

With both of these scenarios, there should be two levels of redundancy.  The first is that sufficient capacity should exist within a specific region to tolerate some instability.  From a messaging standpoint, there should be sufficient IBM MQ queue managers within a particular region to allow for some of them to be down for maintenance, and the applications using those queue managers should not be alarmed by that state.  The second level is when an administrative action is issued to take down an entire region.  There should be at least one other entire region available to process the transaction load.  The case where there are message affinities is more sensitive and needs to be done in a coordinated effort – hopefully via automation.  In both cases, though, the intent should be to perform those region-level failovers without impact to your customers. 

Working with Application Architects when moving workloads to the cloud

While application architects can be very knowledgeable about the technical architecture they have worked with on-premise, their knowledge may not translate to cloud-based deployments when moving workloads to the cloud. While I focused this article on messaging, there are other aspects application architects must consider.  To assist them in designing their overall approaches, you, as a messaging architect, should allocate time to educate the application architects on how the technical architecture changes as part of a cloud migration.

Moving a company’s “bread and butter” systems to the cloud is something that shouldn’t be taken lightly.  However, it isn’t impossible.  By building upon experience gained in performing integration on-premise and then adding the lessons learned in moving non-critical workloads to the cloud, you have a good foundation.