Google launches Parallelstore file storage at cloud AI training


Google Cloud Platform (GCP) has gone live with its Parallelstore managed parallel file storage service, which is aimed at intensive input/output (I/O) for artificial intelligence (AI) applications and is based on the open source – but developed by Intel – Distributed Asynchronous Object Storage (DAOS) architecture. Intel originally intended DAOS to be supported by its Optane persistent memory, but that sub-brand is now defunct.

DAOS, which was on private preview, consists of a parallel file system deployed across numerous storage nodes backed by a metadata store in persistent memory. It replicates entire files onto the maximum number of nodes to allow for parallel access with the least possible latency for customers that are developing AI applications.

Despite the demise of Optane persistent memory – which formed part of the storage class memory technology space – DAOS still rests on some Intel intellectual property.

Those include its communications protocol, Intel Omnipath, which is similar to Infiniband and deploys via Intel cards in compute nodes. These interrogate metadata servers to find the location of a file during read/write operations and then communicate with the node in block mode via RDMA over Converged Ethernet (RoCE).

Saturate server bandwidth

“This efficient data delivery maximises goodput to GPUs [graphics processing units] and TPUs [tensor processing units], a critical factor for optimising AI workload costs,” said GCP product director Barak Epstein in a blog post. “Parallelstore can also provide continuous read/write access to thousands of VMs [virtual machines], GPUs and TPUs, satisfying modest-to-massive AI and high-performance computing workload requirements.” 

He added that for the maximum Parallelstore deployment of 100TB (terabytes), throughput can scale to around 115GBps, three million read IOPS, one million write IOPS, and a minimum latency of near 0.3 milliseconds.

“This means that Parallelstore is also a good platform for small files and random, distributed access across a large number of clients,” said Epstein.

According to Epstein, AI model training times can be speeded up by nearly four times compared to other machine learning data loaders.

GCP’s idea is that customers first put their data in Google Cloud Storage, which can be used for all use cases on GCP and in software-as-a-service applications via virtual machines. That part of the process would allow the customer to select data suited to AI processing via Parallelstore from among all its data. To help here, GCP offers its Storage Insights Dataset service, part of its Gemini AI offer, to help customers assess their data.

Once data is selected as training data, its transfer to Parallelstore can take place at 20GBps. If files are small – less than 32MB, for example – it’s possible to achieve a transfer rate of 5,000 files per second.

Beyond the AI training use cases targeted by GCP, Parallelstore will also be accessible to Kubernetes clusters – such as via GCP’s Google Container Engine (GKE) – through dedicated CSI drivers. In practice, administrators will be able to manage the Parallelstore volume like any other storage attached to GKE.

DAOS is an open source effort object storage system that decouples the data and control planes while also segregating I/O metadata and indexing workloads from bulk storage.

DAOS stores metadata on fast, persistent memory and bulk data on non-volatile memory express (NVMe) solid-state drives (SSDs). According to Intel, DAOS read/write I/O performance scales almost linearly with an increasing number of client I/O requests – to approximately 32 to 64 remote clients – to make it well suited for the cloud and other shared environments.



Source link