Kubernetes at 10: Persistent storage matures, helped by Operators


Kubernetes is 10! Mid-2024 sees the 10th birthday of the market-leading container orchestration platform.

It’s a decade that started with containers emerging as a novel way of virtualising applications but storage and data protection that was practically non-existent. Now, Kubernetes offers a mature container platform for cloud-native applications with all the functionality required for the storage of stateful data.

We mark the first decade of Kubernetes with a series of interviews with engineers who have helped develop Kubernetes and tackle challenges in storage and data protection – including the use of Kubernetes Operators – as we look forward to a future characterised by artificial intelligence (AI) workloads.

Here, Saad Ali, senior staff software engineer at Google – which designed Kubernetes – talks about early storage and data protection challenges, his involvement in container storage interface (CSI) drivers, solving the conundrum of stateful storage for an elastic and portable environment, and how Kubernetes Operators were a massive leap forward in breaking that logjam.

What was the market like when Kubernetes first launched?

Saad Ali: When Kubernetes first launched, developers and vendors were very interested in containers but had no idea where to begin. Docker had energised the industry and made development easier, and there was lots of interest in figuring out how to make it useful for deployments at scale.

Many storage vendors were waiting on the side-lines trying to decide where to invest. A few jumped in early and wrote the original in-tree volume plugins for Kubernetes, which was a daunting process that required checking code directly into the core Kubernetes repository. With all the uncertainty, most storage vendors took the wait-and-see approach. If we never got past that hesitation, Kubernetes would not be what it is today.

How did you get involved in storage for Kubernetes?

Ali: I got involved with the storage side of Kubernetes by fixing a bunch of storage-related issues. Eventually, I was tasked with fixing the race conditions that had existed since 1.0 in the storage stack. I ended up pushing a large re-architecting of the volume lifecycle stack into Kubernetes 1.3, including extracting attach/detach logic from the kubelet and moving it to a central controller to reduce race conditions. That, along with many additional fixes from many developers over multiple subsequent releases, helped improve the stability of the overall volume sub-system.

I became Kubernetes SIG [Special Interest Group] Storage co-lead, and one of my main areas of focus over subsequent years was figuring out how to make Kubernetes storage more extensible. I helped start and develop the container storage interface.

When did you realise Kubernetes was in the leading position in the market?

Ali: For Kubernetes in general, I remember hearing that Pokemon Go was running on top of Kubernetes. That was an incredible moment and a realisation that Kubernetes was taking off.

CSI made storage vendors comfortable enough to invest in building plugins, and led to a virtuous cycle – more vendors developing for Kubernetes made Kubernetes more useful, increased Kubernetes adoption and led to more vendors developing for it
Saad Ali, Google

For Kubernetes storage, when we first started developing CSI, I went to a meetup in San Francisco to talk about it and one of the audience members asked: “What makes this effort different from so many past efforts to standardise storage the OpenStack community has tried? What makes you think this time will be different?”

That was a reminder that success was not guaranteed, so we made an active effort with CSI to work tightly with the storage community and multiple container orchestrators, build very methodically and continuously iterate on CSI, and not call it 1.0 until we had multiple working drivers/integrations.

This made storage vendors comfortable enough to invest in building plugins, and led to a virtuous cycle – more vendors developing for Kubernetes made Kubernetes more useful, increased Kubernetes adoption and led to more vendors developing for it. It was when the list of CSI drivers grew beyond 100 drivers that I realised we’d achieved something special.

When you looked at Kubernetes, how did you approach data and storage?

Ali: Beyond the extensibility of the Kubernetes platform (with integrations like CSI), in my opinion, the magic of Kubernetes storage is that it decouples block/file storage from compute – “separation of concerns” – and makes stateful workloads as elastic as non-stateful workloads. When stateful workloads no longer had to be anchored to the node they were provisioned on, it became easier to self-heal without human intervention.

Even Kubernetes 1.0 included a basic “in-tree volume plugin” system that allowed Kubernetes to automatically attach/format/mount arbitrary block/file storage volumes to pods and unmount/detach them as pods moved across nodes. This was critical in enabling scheduling elasticity for stateful workloads. Even when something went wrong with a compute node, your data could automatically be made available on another node without human intervention.

What issues first came up around data and storage with Kubernetes for you?

Ali: The Kubernetes storage sub-stack is incredibly complicated. We originally dealt with lots of race conditions. One of the biggest challenges with the storage stack is that it has to handle the worst-case trade-off between automatic recovery and potential data loss and corruption. Neither are acceptable outcomes, so Kubernetes SIG Storage has worked incredibly hard to identify and ensure graceful recovery from these edge cases.

The magic of Kubernetes storage is that it decouples block/file storage from compute and makes stateful workloads as elastic as non-stateful workloads
Saad Ali, Google

What had to change?

Ali: Originally, the Kubernetes storage stack assumed a storage admin would pre-provision a pool of block/file volumes of varying sizes for application operators to use. This led to inefficient usage of available storage capacity.

SIG Storage introduced the concept of dynamic provisioning in Kubernetes 1.6. This was a game-changer for the usability of block/file storage at scale and completed the automation of the storage volume lifecycle because it provisioned storage on-demand as needed by workloads, eliminated human intervention and enabled efficient usage of storage capacity.

What happened around Kubernetes Operators that made them a success for data and storage?

Ali: As the building blocks fell into place – StorageClass, PersistentVolumeClaim (PVC) and PersistentVolume (PV) interfaces, CSI, dynamic provisioning – the orchestration of higher-level stateful primitives became a focus.

Kubernetes offered StatefulSets as a building block for stateful workloads. But it became apparent that many complex stateful applications required further careful orchestration of their data to enable things like replication, scaling, and so on.

This is where Kubernetes Operators came in. They offered an easy way to extend Kubernetes to enable application-aware operations like sharding of data to ensure high availability, avoiding all replicas from becoming unavailable at once, and so on.

How did this support more cloud-native approaches? What were the consequences?

Ali: The Kubernetes storage stack along with Kubernetes Operators enable truly elastic use of available compute resources. This, in my opinion, is the heart of what it means to be built in the cloud – having a large elastic pool of compute resources that can be used by all your workloads, on-demand, scaling in and out without human intervention to maximise availability and reduce costs.

At 10 years old, Kubernetes is beginning to feel like Linux for distributed computing. It’s a powerful and extensible open source tool that has become widely adopted and a key building block for modern distributed compute infrastructure
Saad Ali, Google

Kubernetes is now 10. How do you think about it today?

At 10 years old, Kubernetes is beginning to feel like Linux for distributed computing. It’s a powerful and extensible open source tool that has become widely adopted and a key building block for modern distributed compute infrastructure.

What problems still exist around Kubernetes when it comes to data and storage?

Ali: Kubernetes has made it a lot easier to develop portable, scalable stateful applications that leverage block and file storage. But storage is still hard for many developers who don’t want to worry about how different databases work, the differences between asynchronous and synchronous replication, the trade-off of various storage redundancy profiles, and so on.

It may not be Kubernetes’ problem to solve, but, as an industry, we have to make it easier to build stateful applications for all types of developers with varying scale, redundancy and performance requirements while maintaining vendor-agnostic portability.

Any other anecdotes or information to share?

Ali: Open source projects like Kubernetes are possible thanks to lots of amazing contributors from around the world. They need continued maintenance and improvement. I’d encourage anyone interested to come join us. If you’re interested in storage, join Kubernetes SIG Storage. If you’re interested in data, join the Data on Kubernetes community. Get involved and help us drive the next generation of improvements.



Source link