Massive cloud provider outages of the past year show the importance of having a DRaaS plan for your cloud-native applications.
No matter where a business is in its digital transformation, from running a makeshift data center in the broom closet or fully in the cloud with Kubernetes and Azure, every business can agree that data safety is everything. But for those on the latter end of that spectrum, the always-on, multi-tenancy promise from every cloud provider might be lulling IT leadership into thinking their data is safe in one place. Could disaster recovery as a service (DRaaS) find new importance in the cloud-native age?
In the on-premises past, disaster recovery plans meant moving tape backups to an offsite location to protect from natural disasters like hurricanes, floods, wildfires, and earthquakes. Or making sure there’s a failover plan when the building’s power goes out. And more recently, there’s been increased focus on small or medium-sized businesses (SMBs) being able to adequately recover from cyberattacks like ransomware.
But in the cloud-native age, where each cloud provider promises built-in backups, and Kubernetes manifests attempt to simplify the process of spinning up your application in a new cloud in case of disaster, DRaaS seems obsolete. The reality is very much the opposite, and those who don’t take disaster recovery into their own hands will find themselves at the same disastrous end.
How disaster recovery works in the cloud-native age
DRaaS works fundamentally, in the same way, no matter what your infrastructure looks like. In the case of Kubernetes, microservices, and containers, a DRaaS solution replicates and synchronizes each component of your infrastructure to one or more additional public clouds, such as AWS, GCP, Azure, IBM Cloud, Oracle Cloud, and others.
Let’s say you use AWS as your primary public cloud provider, and a massive outage (like the ones that happened back in December 2021) takes your application offline entirely. Instead of twiddling your thumbs or trying to manually launch your cluster elsewhere, you could failover to your DRaaS-replicated cluster and get back to serving your users.
DRaaS providers let you manage multiple virtual private clouds (VPCs) in one place, encompassing both the data and the underlying infrastructure. You can back up, replicate, or migrate whole clusters or just specific parts, even the inter-resource relationships, such as the DaemonSet, ConfigMaps, persistent volumes, pods, secrets, and much more. Scheduled backup policies ensure that your whole disaster recovery plan isn’t hinged on a single IT employee.
Some Kubernetes-friendly DRaaS platforms also help you proactively detect whether the integrity of your cluster has been violated, which means you can respond faster, isolate the affected infrastructure, switch over to the backup infrastructure, and restore from the latest unaffected backing, reducing data loss. And for your end users, a disaster on your end might result in just a few seconds of latency.
And they let you do all this while charging only for what you use—or offering a barebones deployment at minimal cost until you need to flip the switch during a disaster—to keep costs low.
There’s no doubt that despite its configuration complexities, Kubernetes delivers a far more highly-reliable and fault-tolerant infrastructure than bare-metal servers or VMs in a cloud data center. The same goes for the availability of public cloud offerings like AWS or GCP over on-premises servers. But simply pairing the two together doesn’t mean your infrastructure is prepared for a disaster.
Nine-nines of uptime doesn’t matter if a cyberattacker figures out how to run kubectl commands against your cluster. Promises of data redundancy are meaningless if you can’t restore your cluster because you don’t have any viable snapshots, and everything in your persistent volumes is unusable.
In fact, Kubernetes likely requires even more disaster recovery diligence than previous infrastructure solutions. Because of its configuration complexity, and the fact that a cluster can spread across public, private, and hybrid cloud infrastructure, taming what should be a simple project—like a data backup—can be an enormously complex manual process that’s prone to security flaws and misconfigurations when trying to go live again.
Beyond concerns about immediate recovery from disaster, many cloud-native companies are looking into DRaaS to improve their Recovery Point Objective (RPO) under duress from cyberattacks, whether that’s a social engineering effort or a bot-led DDoS attack. RPO is expressed as an amount of time representing the maximum amount of acceptable data loss after an unplanned incident. Because the RPO also defines how old a backup can be before it’s considered unreliable, it’s an essential goal post for IT teams that need to ensure their Kubernetes workloads have the best availability possible.
Whether you’ve looked into DRaaS solutions in your pre-Kubernetes era, or are looking for continuous replication solutions for the first time, the questions you should ask of any potential provider aren’t all that different from the VM days, only with new verbiage:
- Can I use this to back up critical cluster resources and their relationships?
- Can I restore specific resources instead of an entire infrastructure?
- How easily can I migrate workloads from one public cloud to another?
- Am I limited in how often I can snapshot persistent volumes?
- Do I only have to pay for what I use in the event of a disaster?
- Does the failover process include downtime?
An essential part of any disaster recovery plan is acknowledging that no part of your infrastructure is faultless—cloud-native included. By pairing developer/DevOps-friendly tooling with the newest in DRaaS, you might be able to skirt the line between moving fast with the latest tech and making sure it doesn’t crumble between your fingertips at the first sign of disaster.