Your Next-gen Fail-safe Plan: Enterprise Multi-clusters


Meeting the business and user performance and availability expectations for modern cloud-native apps is challenging. To do so, businesses should embrace cooperative multi-clusters for their ability to reduce risks and help in case of a disaster.

Redundancy reduces risk. This is a long-standing tenet of the business world: we’ve known for a long time that it’s best to avoid single points of failure for critical business functions. Critical services are replicated. Critical roles have backups. Critical infrastructure has a disaster recovery plan. As organizations move into the cloud-native world, it’s becoming common for the fail-safe pattern to encompass not just individual microservices (which are easy to replicate in Kubernetes) but entire clusters, often spanning multiple availability zones (AZs). 

This is a natural extension of the long trend of redundancy. Instead of having clusters be completely independent, why not structure things to take advantage of excess capacity in one cluster to help the other? Even better: instead of replicating the same application services in each cluster, think about clusters the way we used to think of namespaces to provide isolation boundaries between microservices in the same application.

As enterprise multi-cluster usage changes, infrastructure projects need to change to support these new patterns. Looking at the history behind this trend and some of the resulting considerations may just convince your team that multi-clusters are your next backup plan.

See also: 3 Ways Kubernetes Adoption Fosters Resiliency

How did we get to multi-clusters?

To understand how we got here, we must look back in history to the pre-cloud era. Back then, you would have your big application server and a big database server, and whenever one of them would crash and your site or service would go down, there would be this mad scramble to try to get it up and running again.

It didn’t take long for someone to suggest having a second server sitting around so that DNS records could be manually swung over if something did go wrong. Soon after that, the question became, “Why is my second server just sitting there idle unless the first one crashes? That’s a waste of money. We should split the load across the two of them, and then, if one goes down, the other can take over the load.” This marked the industry shifting from an active-standby model to an active-active model. Again, that worked pretty well for quite a while, at the cost of still needing manual work during a failure and additional complexity in capacity planning.

Over time, the microservices architecture let us extend this idea to making the workloads the element of replication, allowing finer-grained, more automated control: instead of duplicating a whole machine, we started running multiple replicas of individual workloads in a cluster. Kubernetes makes that easy, including automatically handling load shifting when a replica goes down.

It’s a fairly natural extension from there to the idea of replicating entire clusters the way we used to do with individual servers. More interestingly, we’re starting to see teams treating the entire cluster the way we used to treat individual services, viewing the cluster as a smaller unit within a larger application. We’re now seeing clusters used the way we once used namespaces, where teams can spin up an entire cluster to play in instead of spinning up a namespace in the big production cluster, and the clusters work cooperatively in a single application. Yet, it leads to some interesting questions about managing the network. And the answer is that Kubernetes can do some of it and service meshes can do even more.

Beyond that, we started to get into cloud computing and the idea that we should have clusters working cooperatively in a single application.

Pets vs. cattle

In general, these are all changes that go along with a change widely described in the cloud-native world as moving from viewing our systems like pets to viewing them like cattle. Pre-cloud, we treated computers like pets: when one got sick, we made a great effort to try to “heal” the sick system. In the cloud, we’ve come around to viewing Kubernetes pods and nodes more like cattle: you don’t get attached to them, and if one gets sick, you discard it and let the system spin up a new one for you! The logical extension of this idea is to treat clusters, as well, like cattle.

The self-service mindset

There’s another aspect to all of this: we, as an industry, have learned over time that the only way to truly develop at scale is to use self-service workflows. Developers need to be able to handle most operations tasks themselves to truly work at full speed—but, of course, this gives rise to a certain conflict between permitting development speed and preserving repeatability and controllability for operations. This is probably the single greatest challenge across all of cloud-native computing.

One critical tool for mitigating these control issues is frameworks that permit developers to have full control in some areas but require them to work through additional process steps in others—for example, developers might have full control within a given namespace but need to go through operations gates for anything that affects the entire cluster. A cooperative multi-cluster can make this much more graceful: give the developers an entire cluster, and when they need staging, they can just spin up a new cluster on their laptop. It’s a way that development can have more control over how many environments they have and how they use them, while ops retains more control over a smaller number of production clusters.

Of course, managing multiple clustersintroduces new challenges. The main technical challenge has to do with communications: while Kubernetes automatically handles communications within a cluster, it needs outside help to manage multi-cluster communications. The most straightforward way to provide that help is with a service mesh: any of the major meshes can protect communications between clusters effectively, whether it happens over the public Internet or over dedicated circuits, as well as adding the reliability and observability capabilities that meshes always offer. Additionally, a service mesh can help in terms of control in that it can help with access control across environments.

There are organizational challenges as well. The simplest is that you’ll have more clusters to keep track of, but a more significant one is around cost. You’ll get maximum benefit in terms of reliability with clusters—or at least nodes—in multiple availability zones, but cross-zone traffic tends to be significantly more expensive than traffic within a single zone. Here, new features like Linkerd’s High Availability Zonal Load Balancing (HAZL) can help you: HAZL can understand where zone boundaries are and automatically keep communications within a single zone unless cross-zone traffic is needed for performance or reliability, dramatically reducing costs.

A final thought on multi-clusters

Redundancy reduces risk, so we should embrace cooperative multi-cluster for its ability to actually help in case of a disaster. Like every other transition, it brings with it challenges and risks of its own, but thankfully, in the cloud-native world, the service mesh can help make it relatively pain-free.


About Flynn

Flynn is a technology evangelist at Buoyant, spreading the good word and educating developers about the Linkerd service mesh, Kubernetes, and cloud-native development in general. He has spent four decades in software engineering - from the kernel up through distributed applications, with a common thread of communications and security throughout - and is the original author and a maintainer of the Emissary-ingress API gateway. He can most easily be found as @flynn on the CNCF or Linkerd Slacks.

Leave a Reply

Your email address will not be published. Required fields are marked *