Load balancers do one thing well: spread requests evenly across servers. Round-robin, least-connections, weighted routing, pick your flavor. For stateless web traffic, this works fine. For large language model inference, it quietly causes real problems: the kind that feel like a mystery until you open your cloud bill and the mystery solves itself.
The Mismatch Nobody Talks About
Here’s the thing about web servers: they forget. A request comes in, gets handled, and the server moves on with no memory of it ever happening. That statelessness is actually a feature. It’s what makes horizontal scaling simple.
LLMs are built differently. When a model works through a prompt, it builds up a rich internal picture of everything it has read. Think of it as working memory. In infrastructure terms, that working memory lives in something called a KV cache, and it lives on a specific server. If the next message from that same user gets routed somewhere else, that memory is gone. The new server starts over, recomputing everything from scratch, burning time and money for no reason other than the routing layer had no idea what it was dealing with.
A traditional load balancer looks at things like connection count or how busy a server appears to be. It has no way of knowing which server holds which conversation context. So it guesses, routes blind, and your users feel it as sluggish responses while your infrastructure costs quietly climb.
What GKE Inference Gateway Changes
GKE Inference Gateway was built to solve exactly this. Rather than routing by generic server health, it routes based on which server already holds the relevant context for an incoming request. Conversations stay on the servers that know them. No unnecessary recomputation, no cold starts mid-session.
Google runs Vertex AI on GKE Inference Gateway, so the performance numbers come from real production traffic rather than a controlled test. The results: 96% faster time-to-first-token on workloads like coding assistants where prompts share a lot of common context, and 40% higher throughput from the same hardware. That second number is the one that matters most for the monthly bill.
It also handles something called disaggregated serving, which recognizes that reading a long prompt and generating a response are fundamentally different computational jobs that benefit from different hardware. Forcing both onto the same servers is like hiring one person to be your architect, your contractor, and your building inspector. Technically possible. Deeply suboptimal. Someone is going to have a bad time.
Request prioritization rounds out the picture. You can tell the system which traffic matters most, so fraud detection calls always get through while background batch jobs wait their turn. This sounds obvious until you’ve lived through a background job deciding that 2pm on a Tuesday is a great time to summarize your entire document corpus, while your actual users start refreshing the page and questioning their life choices.
What This Means for Your Business
For engineering teams watching GPU costs climb month over month, the instinct is to provision more capacity. Inference Gateway reframes that decision. Before adding hardware, it’s worth asking whether the hardware you already have is being used well. A 40% throughput improvement means serving the same traffic with significantly less GPU spend, or handling meaningfully more traffic for the same cost. Over time, that compounds.
For ISVs, there’s also a product quality angle. The workloads that benefit most from smarter routing, coding assistants, document Q&A, long-context summarization, are exactly the workloads where response speed is most visible to end users. Faster, more consistent responses aren’t just a cost optimization. They’re the difference between a feature that feels polished and one that feels like it’s thinking too hard.
How the Alternatives Stack Up
AWS and Azure have both put work into LLM serving, but they’ve approached it differently. AWS focuses its optimizations at the model server level. Azure does the same. What neither of them has done is build a managed routing layer that understands the specific way LLM conversations work and routes traffic accordingly. That’s not a knock on either platform. It’s just a gap that exists, and it’s architectural rather than something that gets patched in a future release.
If your team is already running a solid inference server, the model serving layer is probably in decent shape. The honest question is what’s happening in the layer above it. For most teams, the routing layer is a standard ingress controller that knows nothing about the traffic passing through it. It’s doing its best with information it was never designed to use. That’s the gap GKE Inference Gateway was built to close.
Want to go deeper?
- GKE Inference Gateway GA announcement, Benchmark numbers, architecture overview, and supported inference backends.
- GKE and Kubernetes at KubeCon 2025 (Google Cloud blog), Google’s own coverage of Inference Gateway capabilities and production performance data.
- The New Stack: Google Debuts GKE Inference Gateway at KubeCon, Coverage of disaggregated serving, LLM-aware routing, and the KubeCon debut.
