LLM·Dex
VoiceRealtimeAgents

Voice Agents That Don't Feel Slow: A 2026 Architecture

Sub-800ms end-to-end voice agents are achievable in 2026. STT, LLM routing, TTS, latency budgets, and the architectural moves that make voice feel natural.

By LLMDex Editorial

A voice agent feels broken if the user has to wait more than about a second for a response. Below 800ms, conversation feels natural; above 1500ms, users start narrating their wait ("hello? are you there?"). Hitting that latency target with an LLM doing real reasoning between speech-to-text and text-to-speech is genuinely hard, and was infeasible at production quality two years ago.

In 2026 it's achievable. Three things changed: realtime APIs (OpenAI Realtime, Gemini Live), cheap streaming TTS (Cartesia, ElevenLabs Turbo), and frontier models that respond at 100+ tokens/sec on small queries. Putting them together still requires real architecture work. This piece is a working playbook.

The latency budget

Total round-trip latency:

Mic capture → STT first token → LLM first token → TTS first audio → Speaker output

For "natural" conversation, you want the total to be under 800ms. Allocations that work:

  • STT first-token: 200ms
  • LLM first-token: 250ms
  • TTS first-audio: 150ms
  • Network + buffering overhead: 200ms
  • Total: 800ms

Each component has a hard floor. STT can't reliably first-token under ~150ms because you need enough audio to recognize a phoneme. TTS can't first-audio under ~100ms because there's an irreducible audio-encoding cost. LLM first-token depends on the model and the inference stack, most production stacks hit 200-400ms.

Three architectural choices that affect each component:

STT (speech-to-text)

The best 2026 options for realtime STT:

  • Cartesia STT, sub-150ms first-token, multilingual, cheapest at scale.
  • Deepgram Nova, competitive, very mature SDKs.
  • OpenAI Whisper Realtime, bundled with the Realtime API, lower-friction integration if you're already on OpenAI.
  • Google Gemini Live STT, bundled with Gemini's voice products.

For self-hosted: Whisper Large v3 with vLLM is a credible option for budget-constrained deployments, but it's slower than the commercial alternatives and you'll pay engineer time tuning it.

The biggest STT-side latency lever is streaming partials. Instead of waiting for end-of-utterance, your STT emits partial transcripts as the user speaks. You can start LLM inference on the partial transcript, then revise if the partial changes.

LLM routing

Three patterns:

Pattern 1: One-shot full pipeline. STT completes, LLM runs, TTS runs. Simplest. Works only if your LLM first-token is fast and your STT/TTS are fast. Total latency is the sum of each stage's first-token.

Pattern 2: Streaming pipeline. STT streams partial transcripts. LLM starts on the partial. TTS streams audio as LLM tokens come in. Lower total latency but more complex; you have to handle the case where the partial transcript was wrong and you need to discard or revise the LLM output.

Pattern 3: Speculative LLM. Run a small model (e.g., GPT-5 nano) on the partial transcript to start a response speculatively. When the final transcript arrives, decide whether to use the speculative response or run a fresh inference with the full model. This is what OpenAI's Realtime API does internally for some workloads.

Most production deployments use Pattern 2. Pattern 3 is the lowest-latency option but adds significant complexity.

TTS (text-to-speech)

Best 2026 options:

  • Cartesia TTS, sub-100ms first-audio, very natural voices, sub-cent-per-thousand-characters pricing. The clear leader for streaming voice agents.
  • ElevenLabs Turbo / Flash, top-tier voice quality, slightly slower first-audio (~150ms), more expensive at scale.
  • OpenAI Realtime TTS, bundled with the Realtime API.

Don't use non-streaming TTS for voice agents. The latency math doesn't work.

Choosing the LLM

The hardest decision. Three constraints push you toward smaller, faster models:

First-token latency. Frontier models (Claude Opus 4.7, GPT-5.5) typically have 400-800ms first-token. Mini-tier models (GPT-5 mini, Claude Haiku 4) hit 150-300ms. For voice, mini-tier is usually the right pick.

Cost. A voice agent at scale runs ~50K queries per user per month. At Claude Opus pricing ($75 / 1M output tokens), that's expensive. At GPT-5 mini pricing ($2 / 1M output), it's reasonable.

Quality. Voice agents typically deal with shorter, more focused queries than chat. Mini-tier quality is usually sufficient. Reach for flagship only if your domain genuinely needs frontier reasoning.

Recommendation:

  • Default: GPT-5 mini. Good first-token latency, reasonable cost, OpenAI Realtime integration is the smoothest if you're using it for STT/TTS too.
  • For Anthropic-friendly stacks: Claude Haiku 4. Slightly slower first-token, similar quality, easier integration with the rest of an Anthropic-based product.
  • For cost-bound deployments: Gemini 3 Flash. Cheapest mini-tier at competitive quality. Pair with Cartesia for STT/TTS.

Hitting the 800ms budget in practice

Concrete pipeline that delivers ~700ms end-to-end:

[Mic] → Cartesia STT (streaming) → [partial transcript at ~200ms]
     → GPT-5 mini (Realtime API) → [first token at ~450ms]
     → Cartesia TTS (streaming) → [first audio at ~600ms]
     → [Speaker] → [user hears response at ~700ms]

Three implementation details that matter:

Use the Realtime API end-to-end if possible. OpenAI's Realtime API integrates STT, LLM, and TTS in one streaming pipeline with optimized internal handoffs. Total latency on simple queries is ~500ms. The downside is you're locked into OpenAI's STT and TTS quality, both of which are good but not best-in-class.

Or build a custom pipeline with best-in-class components. Cartesia STT + Cartesia TTS + GPT-5 mini gets you slightly higher quality at the cost of more integration work. We use this in production.

Avoid synchronous tool calls during a turn. If your LLM needs to call a database, the database call adds latency to every turn. Pre-compute, cache, or run tool calls asynchronously and inject results between turns rather than mid-turn.

What still doesn't work

Three failure modes:

Long, branching conversations. Voice agents that need to maintain state across many turns (e.g., a complex customer service flow with branching paths) start losing the thread. You need explicit dialog state management (a state machine, a per-conversation memory) on top of the LLM. The LLM alone is not reliable enough at multi-turn state.

Domain-specific accents and code-switching. STT quality on non-standard accents and code-switching (mixing English and another language mid-sentence) is meaningfully worse than on clean English. For consumer voice products serving global markets, this is a real product limit.

Background noise. Even good STT degrades with background noise. Voice agents in noisy environments (cars, public spaces) have worse WER than in quiet environments, and the WER gap shows up as user-visible misunderstandings. Solution: use STT with explicit noise-cancellation models (Deepgram and Cartesia both have noise-robust variants) and validate user inputs back to them.

Cost at scale

For a voice agent serving 100K turns/day:

  • STT (Cartesia, ~10s of audio per turn): ~$0.001/turn → $100/day
  • LLM (GPT-5 mini, ~500 tokens in/out per turn): ~$0.0005/turn → $50/day
  • TTS (Cartesia, ~50 chars per turn): ~$0.0005/turn → $50/day
  • Total: ~$200/day at 100K turns.

That's roughly $0.002 per turn, affordable for most consumer products and trivial for enterprise. The cost has dropped roughly 100x since 2023, which is part of why voice agents have become production-viable.

Concrete recommendation

If you're shipping a voice agent in 2026, start here:

  1. Pipeline: OpenAI Realtime API for the simplest path; Cartesia STT + GPT-5 mini + Cartesia TTS for higher quality.
  2. Latency budget: target 800ms end-to-end. Measure each component. If you're over budget, the LLM is usually the place to optimize first.
  3. Domain integration: state machine for multi-turn conversations, async tool calls, explicit memory.
  4. Iteration: 50-turn end-to-end traces in dev. Listen to them. Latency issues hide easily.

Voice is the AI product category where the gap between "demoable" and "shippable" is largest. The architecture above is the gap. Build it once and the rest is content.

Further reading

Keep reading

Friday digest

Intelligence, distilled weekly.

One short email every Friday, new model launches, leaderboard moves, and pricing drops. Curated by hand. Free, no spam.