Cloud Run GCP: GPU Inference Without the Cluster

Serverless computing and GPUs have historically been a difficult pair. Serverless is about scaling to zero and paying only for what you use. GPUs are expensive, high-demand hardware that usually requires a persistent cluster to be economically viable. If you want a GPU today, you usually have to provision a GKE cluster or a set of VMs and keep them running even when they’re not actively processing requests.

Cloud Run GPUs, which hit general availability in late 2025, change that. This is really how it’s supposed to work: you tell the platform how many GPUs you need, it’s reserved for the duration of your request, and it scales to zero when you’re done.

The Infrastructure Gap

For ISVs, the challenge of GPU inference has always been cost vs. readiness. If you want to offer an AI-powered feature to your customers, you have two bad options. Option one is a warm pool of GPU instances that you pay for 24/7, even if your customers only use the feature twice a day. Option two is scaling from zero on a traditional cluster, which means your customer waits three minutes for a node to spin up and drivers to load before the model doesn’t work because of a timeout.

Cloud Run GPUs solve this by making the GPU a first-class serverless resource. Google manages the hardware, the drivers, and the container orchestration. You just bring your containerized model and a standard Cloud Run configuration. The cold-start times are remarkably low because Google is optimizing the underlying infrastructure specifically for fast GPU attachment.

What It Means for your P&L

The economic shift is significant. If you are building an image generation tool, a video analysis agent, or a custom LLM inference engine, your infrastructure costs now track your revenue exactly. No more over-provisioning for peak and paying for idle. If you have zero requests, you pay zero.

This also simplifies the engineering side. Scaling a GPU cluster based on request volume is notoriously difficult. Handling driver updates across a fleet of machines is a full-time job. Cloud Run GPUs remove that entire layer of operational complexity. Your team spends time on the model and the product, not on the NVIDIA driver compatibility matrix.

Where It Fits (And Where It Doesn’t)

Cloud Run GPUs are designed for inference and lightweight fine-tuning. If you are doing massive, multi-week model training across a cluster of 512 H100s, this isn’t the tool for you (though to be clear, you still want GKE or Vertex AI for that!).

But for the vast majority of ISV use cases such as serving a fine-tuned model, running a specialized computer vision task, or powering an agentic workflow that needs GPU acceleration intermittently, Cloud Run GPUs are the new default. It turns a capital-intensive infrastructure hurdle into a predictable, request-based operating expense.

The Infrastructure Gap

What It Means for your P&L

Where It Fits (And Where It Doesn’t)

Want to go deeper?