Tag Archive

Below you'll find a list of all posts that have been tagged as “LLM Inference”

TurboQuant: Kind of a Big Deal.

Google Research just published a way to cut AI serving costs by 50% with zero accuracy loss. The interesting part is what happens to the ISVs who figure this out first.

Apigee Got a New Job: The Control Plane for Your AI.

Apigee evolved from API gateway to the control plane for LLM traffic, agent actions, and MCP tools. Here is why that matters for anyone building AI features at scale.

GPU Inference Without the Cluster. Cloud Run Finally Makes That Real.

Cloud Run now supports GPUs with scale-to-zero billing. For AI inference workloads that are bursty, sporadic, or just getting started, that changes the math entirely.

LLM Traffic Is Weird. Your Infrastructure Needs to Know That.

Standard load balancers treat LLM inference like any other HTTP traffic. That is expensive and slow. GKE Inference Gateway knows the difference.