Cartesia Sonic 3: Redefining Voice AI Speed

Voice AI has a latency problem. Humans notice delays below 200 milliseconds in conversation. Most AI voice systems, until recently, couldn’t get anywhere close. The result was that AI-powered voice products felt stilted, unnatural, and frustrating to use in real time. You could always tell you were talking to a machine, and not because the voice sounded synthetic, because the pause gave it away.

Cartesia set out to solve that. The company builds real-time voice AI infrastructure, and its flagship model, Sonic, delivers audio responses with sub-90ms latency. That’s fast enough for genuinely natural conversation. More than 50,000 companies are now using it. One competitor benchmarked Sonic as outperforming “its next best alternative by a factor of four.”

What Cartesia Sonic 3 Actually Does

Cartesia Sonic 3, goes well beyond fast. Speed was the foundation, but the new version adds something harder to engineer: emotional authenticity. Sonic-3 can laugh. It can express excitement, concern, or warmth contextually, not as a static tone setting you configure in advance. It handles acronyms and initialisms intelligently, reading “NASA” as a word and “FBI” as letters without needing any special configuration.

That matters more than it sounds. A voice agent that mispronounces common abbreviations immediately loses credibility. A voice agent that sounds flat during a stressful customer support call makes a bad situation worse. Sonic-3 is built around the idea that naturalness is not a nice-to-have: it’s the product.

The model supports 40+ languages covering 95% of the world’s population, including nine Indian languages with exceptional Hindi support. Enterprises building for international markets get native-quality voices rather than translated-sounding approximations. For ISVs targeting global enterprise deployments, that coverage removes a real barrier to expansion.

The Line Platform: From Model to Full Voice Agent Stack

Cartesia isn’t just selling a TTS API. The company has built Line, a complete voice agent development platform that combines Sonic (text-to-speech), Ink (Cartesia’s own speech-to-text model), an SDK, deployment tooling, observability, and built-in evaluation frameworks into a single integrated stack.

The pitch is that voice agent development is currently fragmented across too many vendors and too many integration points. You pick a TTS provider, a STT provider, an LLM, an orchestration layer, and then spend months stitching them together and debugging latency across the seams. Line removes most of that. The entire stack is co-located and optimized end-to-end, which is why latency numbers stay low even under production load.

For ISVs, this is a meaningful architectural decision. Building on a fragmented stack means every component update is a potential regression. Building on Line means Cartesia owns the performance contract across the full pipeline, and that’s a different kind of reliability to offer enterprise customers.

Line is also code-first, which is the right call for enterprise teams. You can start from a single prompt and get to a deployable agent in under 30 seconds, but everything compiles down to SDK code you can version, test, and extend. Background agents can listen, analyze, and write to external systems in parallel during a live call. Tool calling gives agents access to live knowledge bases and external actions. Multi-prompt configuration allows for more sophisticated reasoning than a single system prompt can support.

Cartesia Sonic 3 is Built on Google Cloud

To hit sub-90ms latency at scale, Cartesia needed infrastructure that could keep up. The company built on Google Cloud, using its GPU infrastructure and global network to run inference fast enough that the latency stays imperceptible. When you’re serving real-time voice to tens of thousands of companies, the compute and networking underneath it matter enormously.

The Google Cloud case study describes Cartesia as having built “the world’s fastest voice AI” on Google Cloud infrastructure, with Sonic reaching production quality in human evaluations. That’s not a marketing claim. It’s a constraint. At sub-90ms time-to-first-audio, there is almost no margin for infrastructure variance. Cartesia picked Google Cloud because the GPU availability and network performance could hold that number consistently from P50 to P99, across geographies.

For GCP sellers, this is the kind of case study worth internalizing. Cartesia didn’t choose Google Cloud because of a preferred pricing arrangement or legacy contracts. They chose it because the latency requirement was non-negotiable and Google Cloud was where they could hit it. That’s a technical proof point worth bringing into conversations with ISVs building anything that requires real-time AI inference.

Who Is Already Using This

The customer roster tells you what Sonic is good for in production. ServiceNow is using it for enterprise AI voice agents. Quora’s Poe platform uses it to deliver high-quality, human-like voices across multiple languages. Daily.co, which builds real-time communication infrastructure, calls Sonic “the best voice model today for real-time multimodal use cases.” Tavus, which builds personalized AI video, describes the 90ms latency as a “game-changer” for immersive real-time conversation.

The pattern across all of them: real-time, customer-facing voice interactions where latency is a product-quality issue, not just a performance metric. If the voice hesitates, the user disengages. These companies can’t afford that.

What This Looks Like in Practice

Cartesia is a good example of what the ISV opportunity around GCP AI actually looks like at its best. The company built a product that wouldn’t exist without Google Cloud’s AI infrastructure, and that product is now embedded in tens of thousands of other products.

Every company using Cartesia’s API to build a voice assistant, an AI phone agent, or a real-time translation tool is running on top of infrastructure Cartesia built on Google Cloud. That’s the compounding nature of the ISV model: the infrastructure investment Cartesia made flows downstream to every company in its ecosystem. Cartesia also supports voice cloning (instant clones in 10 seconds, or fine-tuned Pro Voice Clones for enterprises that need a consistent brand voice at scale). That’s the kind of capability that turns a voice API into a platform differentiator.

The compliance posture matters here too. Line is SOC 2 Type II, HIPAA, and PCI Level 1 compliant. It supports SSO and in-VPC deployment for enterprises with data residency requirements. The enterprise checklist is covered, which is increasingly a prerequisite for any ISV trying to sell into regulated industries.

Sub-90ms latency was the hard technical requirement that made real conversation possible. Google Cloud is where Cartesia achieved it. The rest of the product (the emotion, the multilingual coverage, the full-stack Line platform, the compliance certifications) is what turns that technical achievement into a business.

Want to go deeper?

Cartesia customer story (Google Cloud), How Cartesia built the world’s fastest voice AI on Google Cloud infrastructure.
Cartesia Sonic-3, The latest model with emotion, laughter, and 40+ language support.
Cartesia Line, The end-to-end voice agent development platform.