What actually drives sub-800ms turn-taking in voice AI

Q: Why does turn-taking under 800ms matter for voice AI?

Below 800ms perceived turn-take, callers stop noticing the agent is synthetic and a conversation flows naturally. Above one second, callers start barging in, repeating themselves, and asking for a human. The threshold tracks human conversational pause distributions, which cluster between 200 and 600 milliseconds, with longer pauses signaling discomfort.

Q: Where does most voice AI latency actually come from?

End-of-utterance detection, model first-token latency, and buffered TTS synthesis are the three biggest contributors. Naive VAD alone consumes 500–700ms before the model is even called. Buffered TTS adds 400–600ms after the model finishes. Streaming ASR, learned end-of-utterance detection, capped context, and streaming TTS each take a chunk out of the budget.

Q: How is end-of-utterance detection different from voice activity detection?

Voice activity detection uses acoustic energy to decide if speech is happening. End-of-utterance detection uses prosody — intonation, sentence-final cadence, pause distribution — to decide if the speaker is done. EOU is faster and more accurate than waiting for sustained silence, which is the naive VAD approach. We get reliable EOU at 180–220ms with a 500ms backstop.

Q: Does context size affect voice AI latency?

Yes, materially. First-token latency is dominated by prefill, which scales with context length. A 30,000-token context is meaningfully slower at first token than a 4,000-token one. Voice deployments cap conversation history, retrieved chunks, and system prompt aggressively. A long-context architecture works for batch tasks but not for sub-800ms conversations.

Q: Can streaming TTS be added to an existing voice system?

Yes, but the architecture has to support it end-to-end. The text generation has to stream tokens to TTS, the TTS has to expose a streaming synthesis API, and the audio playback path has to handle progressive frames including barge-in cancellation. Retrofitting streaming TTS onto a buffered system often saves 300–500ms but requires rewriting the audio output pipeline.

Q: How do you measure perceived latency in production?

Timestamps at every stage: end-of-utterance fired, ASR final returned, first model token emitted, first TTS audio frame produced, audio leaving the inference cluster, audio reaching the carrier, audio reaching the handset. Each segment is dashboarded with p50, p95, p99, and aggregate perceived turn-take. Regression in any segment is treated as a release blocker.

Perceived latency is not the same as model latency

When we say sub-800ms turn-taking, we mean the perceived gap between when the caller stops speaking and when they start hearing the agent's reply. That number is what the human ear judges, and the human ear is unforgiving. The model itself producing first tokens in 200ms is necessary but nowhere near sufficient. The other 600ms is consumed before and after the model call, and most teams measure the model and ignore the rest.

We instrument every layer of the voice stack with timestamps. End-of-utterance detection, ASR final, first model token, first TTS audio frame, audio leaving our network, audio arriving at the carrier, audio reaching the caller's handset. The 800ms budget gets divided across those segments and held to per release.

Voice activity detection is where most teams lose 200 milliseconds

Detecting that the caller has stopped speaking — without prematurely cutting them off mid-pause — is the silent killer in latency budgets. Naive VAD waits 500–700ms of silence to be sure the caller is done. That alone exhausts most of the budget before the model has been called.

The fix is a learned end-of-utterance detector that combines acoustic VAD with a lightweight prosodic model: rising intonation suggests a question is mid-formed, declarative sentence-final intonation suggests the caller is done. We get end-of-utterance detection down to 180–220ms reliably, with a backstop hard cutoff at 500ms for outliers.

End-of-utterance: ~200ms learned EOU detector
ASR final: < 100ms after EOU, streaming partials
First model token: < 250ms short-context warm path
First TTS audio: < 150ms streaming synthesis

Streaming ASR with stable partials lets the model start thinking early

Treating the speech-to-text path as a request-response that returns a final transcript at the end of utterance is the second-largest source of avoidable latency. The right architecture streams partial transcripts continuously, and the reasoning model receives a stable partial 200ms before the caller stops speaking. By the time end-of-utterance fires, the model has been generating tokens against the partial for a few hundred milliseconds.

There is a tradeoff here: starting the model on a partial that flips on the last word means burning some compute on a generation that gets discarded. We accept that cost; the wall-clock latency improvement is worth the marginal inference spend.

First-token latency on the model depends on warm path, context size, and prefill

Once the prompt reaches the model, first-token latency is dominated by prefill — the time to compute attention over every token in the context. A 30,000-token context produces a noticeably slower first token than a 4,000-token one, even on the same model and the same hardware. For voice, ruthless context discipline is the rule: short system prompt, retrieved chunks capped, conversation history truncated to recent turns.

Warm path matters. The first call after a cold model load is 1.5–3x slower than steady-state because of compilation, cache warmup, and KV-cache initialization. Production deployments keep instances warm with synthetic traffic when real traffic is light.

TTS streaming changes the math because audio plays before synthesis finishes

Generating the entire TTS audio buffer before playing it is the easiest way to design a voice system. It is also the slowest. A two-second response synthesizes in roughly 600ms on a good model; the caller hears it 600ms after the model finishes generating, which adds enough latency to fail the budget.

Streaming TTS plays the first audio frame as soon as the first 80–150ms of speech is synthesized, while the rest of the synthesis runs in parallel with playback. The user hears the agent start speaking faster than the agent has finished thinking the rest of the sentence. The perceived turn-take improves by 400–600ms versus buffered synthesis.

Barge-in handling is part of the latency budget, not a separate concern

When the caller starts speaking while the agent is still talking, the agent has to stop within 100ms or the caller experiences the agent as a steamroller. Barge-in is a latency problem in disguise: detect the caller's voice, cancel the in-flight TTS playback, cancel the in-flight model generation, and resume listening — all under 100ms.

The mistake we see is treating barge-in as a UX feature added late. By then the architecture has assumed one-shot turn-taking and the cancellation paths are slow. Designing for barge-in from day one keeps the latency budget honest under conversational load, which is the load production traffic actually has.

Network and carrier transit are the latency you cannot fix from the code

Roughly 100–150ms of the perceived turn-take is consumed by audio crossing the network from your inference cluster to the telephony carrier and from the carrier to the caller's handset. This is not negotiable, which means the rest of the budget has to be tighter to compensate. Co-locating inference with the carrier's points of presence helps; running model inference on the wrong continent from your callers is fatal.

We measure carrier transit as part of the latency dashboard and route traffic to the inference region closest to the originating area code. The improvement is modest in absolute terms but reliably worth it on a tight budget.

We obsessed over model latency for two months and shaved 80ms. Then we audited end-of-utterance detection and found we were waiting 600ms of silence before doing anything. Fixed that in a week and the conversations finally felt like conversations.
— Voice infrastructure lead, AI Voice Agent deployment

Frequently asked

Why does turn-taking under 800ms matter for voice AI?

Below 800ms perceived turn-take, callers stop noticing the agent is synthetic and a conversation flows naturally. Above one second, callers start barging in, repeating themselves, and asking for a human. The threshold tracks human conversational pause distributions, which cluster between 200 and 600 milliseconds, with longer pauses signaling discomfort.

Where does most voice AI latency actually come from?

End-of-utterance detection, model first-token latency, and buffered TTS synthesis are the three biggest contributors. Naive VAD alone consumes 500–700ms before the model is even called. Buffered TTS adds 400–600ms after the model finishes. Streaming ASR, learned end-of-utterance detection, capped context, and streaming TTS each take a chunk out of the budget.

How is end-of-utterance detection different from voice activity detection?

Voice activity detection uses acoustic energy to decide if speech is happening. End-of-utterance detection uses prosody — intonation, sentence-final cadence, pause distribution — to decide if the speaker is done. EOU is faster and more accurate than waiting for sustained silence, which is the naive VAD approach. We get reliable EOU at 180–220ms with a 500ms backstop.

Does context size affect voice AI latency?

Yes, materially. First-token latency is dominated by prefill, which scales with context length. A 30,000-token context is meaningfully slower at first token than a 4,000-token one. Voice deployments cap conversation history, retrieved chunks, and system prompt aggressively. A long-context architecture works for batch tasks but not for sub-800ms conversations.

Can streaming TTS be added to an existing voice system?

Yes, but the architecture has to support it end-to-end. The text generation has to stream tokens to TTS, the TTS has to expose a streaming synthesis API, and the audio playback path has to handle progressive frames including barge-in cancellation. Retrofitting streaming TTS onto a buffered system often saves 300–500ms but requires rewriting the audio output pipeline.

How do you measure perceived latency in production?