Here’s a number that should bother anyone running AI workloads at scale: large-model training jobs burn 20% to 40% of their accelerator time doing nothing. Not training. Not inferencing. Just waiting for data. At $3 to $10 per GPU hour, that idle fraction is real money, and it shows up on the GPU bill rather than the storage bill.
That misattribution is the root of the problem. Teams assume the bottleneck is in the training code, the model architecture, or the batch configuration. But if GPU utilization sits below 85% during active training, storage is almost always the culprit. Adding more GPUs doesn’t help when they’re waiting on the same slow storage. The fix requires looking somewhere most ML teams don’t look first.
What Google Built for This
Hyperdisk ML is a storage type on Google Cloud built specifically for loading large model weights onto accelerators. Its defining feature is multi-attach: a single volume connects simultaneously to up to 2,500 GPU or TPU instances in read-only mode. Instead of copying weights to every inference replica separately, you provision one volume and mount it fleet-wide. A 70-billion-parameter model that previously required hundreds of separate copies now requires one. Storage costs drop, and cold-start time for fleet-wide updates drops with it.
Rapid Storage handles the training data ingestion problem differently. It delivers sub-millisecond read latency and roughly 20x faster data loading than standard cloud object storage. Co-located with accelerator pods, it keeps GPUs fed continuously rather than idle between batches. InstaDeep used Google Cloud TPUs and high-performance storage to train genomics models with over 20 billion parameters, achieving near-linear scaling where traditional storage would have become a hard ceiling.
Parallelstore rounds out the picture for distributed training. Built on the DAOS architecture, it delivers approximately 115 GiB/s of throughput and sub-0.3 millisecond latency at scale. In Google Cloud’s benchmarking, it enabled 3.9x faster training times and 3.7x higher throughput compared to native ML framework data loaders. In practice, a four-day training run might complete in roughly one. The infrastructure cost stays the same; the output per dollar improves substantially.
The Competitive Gap
Neither AWS nor Azure offers a managed equivalent to Hyperdisk ML’s multi-attach read-only semantics for inference fleet weight sharing. AWS has high-throughput HPC storage, but it requires manually provisioning and managing distributed file systems. That’s engineering overhead with no relation to your AI roadmap. The managed, purpose-built approach is the differentiator, and as inference fleets scale, that operational simplicity compounds quickly.
One thing worth flagging for teams planning long-horizon investments: Parallelstore is scheduled for deprecation in October 2026, with Google Cloud Managed Lustre as its successor. The migration path is defined. More importantly, the direction is clear: Google is deepening its investment in purpose-built AI storage, not stepping back from it.
Why This Matters to ISVs
The internal case is straightforward. If your team trains or fine-tunes models, faster storage means your accelerator budget produces more useful work. Faster iteration cycles lead to better models shipped sooner. That compounding advantage is especially meaningful during competitive product development cycles where shipping first matters.
The product-side case is more interesting. An ISV building a model-serving or MLOps platform can use Hyperdisk ML multi-attach to offer customers a real, architecturally-grounded SLA on model deployment speed. Specifically, fast cold-start times and fleet-wide weight propagation become promises you can actually keep, because the storage architecture supports them. That’s the kind of infrastructure advantage you can put in a sales deck without having to hedge it.
Beyond inference serving, ISVs building AI training platforms on Google Cloud can use Parallelstore and Rapid Storage to differentiate on training efficiency. Customers who run large training jobs regularly care deeply about accelerator utilization. They’re paying for it by the hour. Consequently, a platform that demonstrably improves that utilization has a value proposition that shows up directly on the next training invoice, which is a much easier conversation than abstract performance claims.
The underlying question is worth asking: how much of your current GPU spend is generating actual training work, and how much is idle time while storage catches up? For most teams, the answer is uncomfortable. However, the fix is available now, the performance numbers are publicly documented, and the managed services handle the operational complexity. The barrier to starting is lower than the potential upside would suggest.
Want to go deeper?
- Google Cloud AI Hypercomputer Blog, Hyperdisk ML specs, multi-attach architecture, and Rapid Storage announcement.
- Hyperdisk ML Official Documentation, Multi-attach limits and model weight loading performance details.
- Google Cloud Parallelstore, High-performance parallel file system for AI training and HPC workloads.
- Rapid Storage Overview, Sub-millisecond object storage for AI training workloads co-located with accelerator pods.
