Google Built a TPU for the Age of Inference. Meet Ironwood.

For most of the history of AI infrastructure, training got all the attention. Faster model checkpoints, bigger parameter counts, splashier benchmarks. That’s the glamorous side of the stack, and it’s where the research energy has gone.

Inference is where the money goes. Serving a model to real users continuously is the cost that actually compounds as AI products grow up. Google built Ironwood for that reality, and it shows in the design choices.

Built for Serving, Not Just Training

Ironwood is Google’s 7th-generation custom AI chip, announced in April 2025. Every previous TPU generation was primarily a training accelerator. Ironwood is the first one Google explicitly designed around inference: high volume, low latency, efficient at scale.

The headline numbers are large enough to be meaningless without context, so here is the one that actually matters for inference workloads: 192 GB of memory per chip, with 7.37 terabytes per second of memory bandwidth. That is six times the memory capacity of the previous generation. The reason this matters is simple: large models need to fit in memory to run efficiently. Models that previously required splitting across multiple chips, with all the coordination overhead that creates, now fit on one. Fewer chips per request means lower latency and lower cost per response.

The raw compute numbers are also worth stating. Each Ironwood chip delivers 4,614 TFLOPs of peak FP8 performance. At pod scale, up to 9,216 chips can be configured together, producing 42.5 exaflops of total compute. That is more than 24 times the compute of El Capitan, the world’s largest supercomputer. For most inference workloads, you won’t use a pod of that scale. But the architecture that makes pod scaling possible is the same architecture that makes single-chip inference efficient.

Ironwood also introduces native support for FP8, the numerical precision format used for quantized inference. Running models at FP8 instead of higher precision reduces memory requirements and improves throughput without significantly degrading output quality. Previous TPU generations required software workarounds to reach FP8. Ironwood handles it in hardware, which removes overhead and simplifies deployment.

The Economics Are the Point

Ironwood delivers twice the performance per watt of its predecessor, Trillium, and more than 4x better performance per chip compared to TPU v6e. At the scale where inference costs actually matter, millions of requests per day, a 2x improvement in efficiency is a meaningful reduction in infrastructure spend. It is the kind of improvement that changes whether an AI-powered feature is profitable at a given price point.

There is also a supply angle worth mentioning. NVIDIA GPU procurement at volume has involved allocation queues stretching 6 to 18 months for recent generations. Ironwood is Google-designed and Google-owned. It is available through Vertex AI without navigating third-party allocation constraints. For teams planning infrastructure scaling, that certainty has real operational value regardless of raw performance comparisons.

The SparseCore accelerator built into Ironwood is worth calling out separately. It handles embedding-heavy workloads, recommendation systems, search ranking, and retrieval, more efficiently than general-purpose compute. If your AI product involves any kind of ranking, matching, or retrieval at scale, SparseCore is doing meaningful work that would otherwise require more chips or more latency.

Who This Actually Affects

Ironwood matters most to teams running high-volume AI inference in production: enough volume that cost per token is a line item someone watches. At prototype scale, compute choice barely matters. At production scale, it determines product margin.

It is also relevant for anyone working with very large models or long-context processing. The memory headroom on a single Ironwood chip changes what is feasible without multi-chip distribution complexity. A model that previously required four chips to run efficiently might run on one. That is not just a cost reduction. It is a latency reduction, because coordinating across chips adds overhead that single-chip serving eliminates.

Ironwood powers Google’s own AI workloads, including Gemini serving. When you access it through Vertex AI, you are running on the same infrastructure class Google uses internally. That’s not a marketing claim. It’s a practical consequence of Google building hardware for its own needs and making it available to customers.

The questions worth sitting with: What proportion of your AI infrastructure spend today is inference versus training? If your cost per token dropped by half, which features currently off your roadmap would become economically viable? And if GPU procurement delays have affected your scaling plans, is on-demand access to Google-owned inference hardware worth a closer look?

Want to go deeper?