Kubernetes disaster recovery: Five key questions


Kubernetes deployments offer plenty of advantages to enterprises that want to update their infrastructure and move to a cloud-native architecture.

But a lot of what makes Kubernetes attractive to developers and CIOs also creates potential problems when it comes to backup and disaster recovery (DR).

Conventional monolithic applications and even virtual machines (VMs) are relatively easy to backup and to build into a disaster recovery plan, as long as it’s done with care. But Kubernetes and containers – with their interconnected microservices, stateless applications, and persistent storage – work in a different way. Disaster recovery needs to allow for this.

Why do we need DR for Kubernetes?

Increasingly, containers and Kubernetes are used in production. As a result, Kubernetes is more likely to handle vital data and key business processes.

Organisations need to protect the data and the various microservices that make up a Kubernetes-based application and ensure that they can recover them accurately and in a timely manner.

And IT teams need to ensure all the critical parts of a Kubernetes-based deployment are covered by the disaster recovery plan. It is not just a question of protecting persistent storage with standard and immutable backups, organisations need to protect the entire cluster and its components and data so they can restore it seamlessly. All this also needs testing.

What are the challenges of DR for Kubernetes?

DR for Kubernetes clusters means identification and protection of cluster components and its configuration.

Then there are data volumes. Increasingly, Kubernetes data is on persistent storage, which makes the disaster recovery team’s task somewhat easier. But DR specialists need to be aware of where Kubernetes-based applications store data, as they can run across local, cloud and hybrid storage.

The good news, according to Gartner analyst Tony Iams, is that container applications have features that lend themselves to disaster recovery and business continuity, even if granular backup is trickier.

“The inherent portability and immutability of containers make it easier to replicate a complete application stack consistently at multiple locations,” he says. “Using continuous integration/continuous deployment [CI/CD] processes, containerised applications can easily be rebuilt and delivered where and when they are needed, either at a secondary site, or to reconstruct a primary site after a failure occurs.”

What are the risks to Kubernetes environments that need to be mitigated by DR?

The risks to Kubernetes are the same as those faced by any other enterprise technology operating environment: hardware failure, software problems – including in the underlying Linux OS – power or network failures, physical disasters and of course, cyber attacks including ransomware.

However, containers’ flexibility and distributed nature can make applications vulnerable to single points of failure; distributed architectures can magnify the impact of hardware outages.

An enterprise could, for example, replicate an entire virtual machine, or create an immutable snapshot, and be reasonably confident they have captured everything needed to run an application or business process. With Kubernetes, there are more dependencies.

Iams identifies the way containerised applications handle storage as a specific risk. Unlike conventional applications, which use the host operating system’s file system, “containers persist data using volumes that write data to storage outside the container’s own local file system”, he says.

If containers are in Kubernetes clusters, then IT teams need to ensure that manifests and other policy configurations are backed up, and that containers can reattach to their storage after a restore.

What key points would a DR plan for a Kubernetes environment contain?

Successful disaster recovery for a Kubernetes environment will typically be more granular than a recovery plan for conventional applications.

Firms can reduce downtime and data loss, provided they can recover specific parts of the Kubernetes system rather than whole clusters. Each part of a Kubernetes environment could have its own recovery point and recovery time objectives (RPO/RTO).

This, however, requires IT teams to have a comprehensive and up-to-date picture of their Kubernetes components and the business processes they support.

As for a DR plan for conventional environments, one approach is to prioritise the services that need to be restored first.

Here it’s useful to ask two linked questions:

  • Which Kubernetes-based applications are most critical to business operations, and so need to be back online first?
  • Which (Kubernetes) services and dependencies will bring those containers back most quickly?

Done well, this could allow an organisation to bring its applications online, perhaps with reduced functionality, more quickly than if they relied on restoring an entire cluster.

The exact approach will likely depend on the organisation’s maturity and approach to risk.

“At this stage, cloud-native and traditional infrastructure engineers have different views on how to best approach the problem,” says Iams.

“Cloud-native engineers prioritise redeployment methods via CI/CD workflows, while traditional approaches rely on backup and recovery tools for Kubernetes applications and data protection.” The analyst firm recommends an application-centric approach if the organisation is mature enough.

What are the infrastructure requirements of DR for Kubernetes?

When it comes to physical infrastructure, Kubernetes’ flexibility should make it easier to recover an application. This could be from on-premise hardware to the cloud, or even by moving between cloud providers.

DR specialists need to ensure the required resources are available. This includes the compute requirements to run the Kubernetes clusters and the storage space to recover persistent volumes. Suitable network resources are essential too.

For recovery of applications, if IT teams have used an application-centric GitOps approach, they can use ArgoCD or Flux CD for recovery.

Otherwise, the best approach is likely to be a tool from a vendor that specialises in Kubernetes, such as Kasten, Trilio, CloudCasa, or Cohesity (which now also owns Veritas’ data protection business). Vendors such as Commvault and Rubrik also support containers and cloud-native applications.

These are “Kubernetes-aware” tools that deploy on clusters and understand how clusters make up an application – and how to restore them if there is an outage.



Source link