AI voice agents: When to build and when to skip

Key Takeaways

  • Production voice AI requires a streaming pipeline processing STT, LLM inference, and TTS simultaneously to achieve sub-500ms response times.

  • The STT-to-TTS pipeline has three latency bottlenecks: speech-to-text transcription, LLM reasoning, and text-to-speech synthesis - each must be optimized independently.

  • Users tolerate up to 800ms of silence before conversations feel unnatural; above that, completion rates drop significantly.

  • Handling interruptions (barge-in detection) and turn-taking naturally are harder engineering problems than the core speech pipeline.

AI voice agents handle phone calls, voice commands, and spoken interactions autonomously. Unlike text-based AI agents that process typed input, voice agents add two demanding constraints: sub-second latency and natural-sounding speech. Get either wrong and the conversation feels broken.

TL;DR

AI voice agents use a three-stage pipeline: speech-to-text (STT), LLM reasoning, and text-to-speech (TTS). The total round-trip must stay under 800ms to feel natural. Current top-performing systems achieve 400-600ms. The biggest technical challenges are reducing latency at each stage, handling interruptions (barge-in), and managing turn-taking. Voice AI is production-ready for structured conversations (appointment booking, order status) but still struggles with open-ended, emotionally complex calls.

Voice Agent Pipeline

1

Stage 1: Speech-to-Text (STT)

100-300ms

Audio is captured and converted to text using streaming providers like Deepgram (100-200ms), Google Speech-to-Text, or Whisper.

2

Stage 2: LLM Reasoning

200-500ms

Transcribed text goes to the LLM, which generates a response based on conversation context, tools, and instructions. Streaming starts sending to TTS as first tokens arrive.

3

Stage 3: Text-to-Speech (TTS)

100-200ms

LLM response is converted to natural-sounding speech using ElevenLabs (highest quality), Cartesia (low latency), or PlayHT. Audio streams as sentences are generated.

The voice pipeline

Every voice agent follows the same three-stage pipeline:

Stage 1: Speech-to-text (STT)

The user speaks. The audio is captured and converted to text. This stage takes 100-300ms depending on the provider and whether you use streaming.

Key providers: Deepgram (fastest, 100-200ms streaming), Google Speech-to-Text, Azure Speech Services, AssemblyAI, Whisper (open source, higher latency).

Optimization: Use streaming STT. Instead of waiting for the user to finish speaking, process audio in chunks as it arrives. Deepgram's streaming API starts returning partial transcripts within 100ms.

Stage 2: LLM reasoning

The transcribed text goes to the LLM. The LLM generates a response based on the conversation context, tools, and instructions. This is the slowest stage - typically 200-500ms for the first tokens.

Optimization: Use streaming LLM responses. Start sending text to the TTS engine as soon as the first tokens arrive, not after the full response is generated. This overlaps Stage 2 and Stage 3.

Stage 3: Text-to-speech (TTS)

The LLM's text response is converted to speech audio. Modern TTS engines produce natural-sounding speech with emotion and pacing.

Key providers: ElevenLabs (highest quality), Cartesia (low latency), PlayHT, Azure Neural Voices, Google WaveNet.

Optimization: Stream TTS output. Start playing audio from the first sentence while later sentences are still being generated.

Total pipeline latency

StageStandardOptimized (Streaming)
STT300-500ms100-200ms
LLM500-2000ms200-400ms (first tokens)
TTS200-500ms100-200ms (first audio)
Total1000-3000ms400-800ms

The optimized pipeline achieves latency comparable to human conversation pauses (300-700ms between turns).

400-800msOptimized voice pipeline latencyDown from 1,000-3,000ms with non-streaming approaches.

Pipeline Latency: Standard vs Optimized

Standard (Non-Streaming)Optimized (Streaming)
STT300-500ms100-200ms
LLM500-2,000ms200-400ms
TTS200-500ms100-200ms
Total round-trip1,000-3,000ms400-800ms

McKinsey's research on generative AI in customer service found that companies deploying AI voice tools saw a 14% improvement in issue resolution per hour and a 9% reduction in average handle time - gains driven almost entirely by cutting pipeline latency at each stage.

Handling interruptions (barge-in)

Humans interrupt each other constantly. A voice agent must handle interruptions gracefully:

  1. Detect the interruption: Monitor the audio input while the agent is speaking. When the user starts talking, the agent should stop.
  2. Stop playback: Immediately cease the current TTS output.
  3. Process the interruption: Send the new user speech through the STT pipeline.
  4. Discard unspoken text: If the LLM generated text that wasn't spoken yet, decide whether to discard it or save it for later.
  5. Respond to the new input: Generate a new response that acknowledges the context shift.
Barge-in detection is one of the hardest problems in voice AI. False positives (background noise triggering interruption) break the conversation flow. False negatives (missing a real interruption) make the agent seem unresponsive.

Turn-taking

Knowing when the user has finished speaking is surprisingly difficult. Silence alone isn't a reliable signal - people pause mid-sentence to think.

Approaches to end-of-turn detection:

  • Silence duration: Wait for 500-700ms of silence. Simple but causes awkward pauses.

  • Prosodic cues: Detect falling pitch and slowing tempo that signal sentence endings. More natural but harder to implement.

  • Semantic analysis: Use the STT transcript to predict whether the sentence is complete. Most accurate but adds latency.

  • Hybrid: Combine silence detection with semantic completeness checking. Best results in practice.

The ideal system adapts its turn-taking behavior to the conversation. During rapid exchanges, use shorter silence thresholds. During complex questions, wait longer.

"Turn-taking is deceptively hard. In testing, we found that a 600ms silence threshold worked well for appointment booking but felt awkward in sales qualification calls where users pause to think. The system needs to learn the rhythm of each conversation type - there's no single threshold that works for all use cases." - RaftLabs Engineering Team

Voice quality and personality

"Latency is the make-or-break metric in voice AI. We've seen demos that looked impressive at 600ms fall apart in production at 1,200ms because nobody optimized the STT-to-LLM handoff. Get the pipeline right before you worry about voice selection or personality." - Ashit Vora, Captain at RaftLabs

The voice your agent uses defines its personality. Consider:

Voice selection: Choose a voice that matches your brand and use case. A healthcare appointment system needs a calm, reassuring voice. A sales agent needs an energetic, warm voice.

Pacing and prosody: Control speaking speed, pause duration, and emphasis. Slower for important information. Faster for routine confirmations. Pauses before key numbers.

Emotional range: Modern TTS engines support emotional modulation. The agent should sound empathetic during complaints, confident during problem resolution, and warm during greetings.

Consistency: Use the same voice across all interactions. Switching voices between calls breaks trust.

"Voice selection sounds like a minor detail until you hear the wrong voice on a 10,000-call-per-day system. We've seen clients lose trust in an AI that solved every problem but sounded robotic doing it. The voice is the product - it's what customers remember." - RaftLabs Engineering Team

Production considerations

Call recording and compliance

Record all calls with appropriate disclosure. Many jurisdictions require informing callers they're speaking with AI. Build the disclosure into the greeting: "Hi, this is an AI assistant from [Company]. How can I help you?"

Fallback to human

Always provide a path to a human agent. "Let me connect you with a team member" should be triggered by: low confidence in understanding, user frustration signals, complex requests beyond the agent's scope, or explicit user request.

Telephony integration

Voice agents connect to phone systems via SIP trunking or WebRTC. Providers like Twilio, Vonage, and Retell handle the telephony layer. Your AI agent handles the conversation logic while the telephony provider handles the transport.

Cost per call

  • STT: $0.005-0.02 per minute

  • LLM: $0.01-0.10 per call (depending on conversation length and model)

  • TTS: $0.01-0.05 per minute

  • Telephony: $0.01-0.05 per minute

  • Total: $0.05-0.25 per minute, or $0.25-1.50 for a typical 5-minute call

Compare to human agent cost of $1-3 per minute (fully loaded), and the economics are compelling for high-volume use cases. Gartner predicts conversational AI will reduce contact center agent labor costs by $80 billion by 2026 - and voice is the channel where the bulk of that shift happens. This is why AI customer service agents using voice are growing faster than text-only alternatives.

Cost Per Call: AI Voice Agent

Typical 5-minute call

Total AI voice agent cost for a typical 5-minute call, compared to $5-$15 for a human agent.

$0.25-$1.50
STT (speech-to-text)

Audio transcription via Deepgram, Google, or Whisper

$0.005-$0.02/min
LLM inference

Varies by conversation length and model choice

$0.01-$0.10/call
TTS (text-to-speech)

Voice synthesis via ElevenLabs, Cartesia, or PlayHT

$0.01-$0.05/min
Telephony

SIP trunking or WebRTC via Twilio, Vonage, or Retell

$0.01-$0.05/min

Human agent cost: $1-$3 per minute fully loaded. AI voice agents deliver 5-10x cost savings at high call volumes.

Where voice AI works today

A December 2024 Gartner survey found 85% of customer service leaders plan to explore or pilot customer-facing conversational GenAI in 2025, with 44% specifically exploring voicebots. Adoption is moving from experimentation to production deployment.

High-confidence use cases:

  • Appointment scheduling and confirmation

  • Order status and tracking

  • Restaurant reservations

  • Payment reminders

  • Survey collection

  • After-hours call handling

Emerging use cases:

  • Technical support with troubleshooting flows

  • Sales qualification calls

  • Insurance claims intake

  • Healthcare symptom triage

Not ready yet:

  • Emotionally sensitive calls (collections, bad news delivery)

  • Complex negotiations

  • Unstructured conversations with no clear goal

Gartner predicts agentic AI will autonomously resolve 80% of common customer service issues without human intervention by 2029 - including voice interactions. The "not ready yet" list will shrink significantly over the next 24 months.

Voice AI is advancing fast. The latency and quality gaps are closing each quarter. For structured, high-volume conversations, it is production-ready today.

At RaftLabs, we have built production voice agents handling thousands of daily calls across hospitality and fintech. The pattern that works: start with a structured use case (appointment booking, order status), nail sub-500ms latency, then expand scope. Our AI agent development team ships voice-capable agents in 12-week sprints, with latency optimization as a core engineering focus from day one.

Frequently Asked Questions

RaftLabs has built production voice agents handling thousands of daily calls with sub-500ms latency across fintech and hospitality. We handle the full pipeline: STT selection, LLM optimization, TTS integration, barge-in detection, and telephony. 100+ AI products shipped in 12-week sprints.

AI voice agents use a three-stage pipeline: speech-to-text (STT) converts audio to text, an LLM processes the text and generates a response, and text-to-speech (TTS) converts the response back to audio. Production systems stream all three stages simultaneously to achieve sub-500ms response times.

Production voice agents need sub-500ms end-to-end response time for natural conversation flow. Anything above 800ms causes users to talk over the AI, derailing conversations and dropping completion rates. Achieving this requires a streaming pipeline and careful turn-taking detection.

The three biggest challenges are latency optimization across the STT-LLM-TTS pipeline, barge-in detection (handling when users interrupt the AI mid-sentence), and natural turn-taking management. Users speak more directly and less patiently to AI, requiring thorough edge case handling.

Total cost is $0.05-0.25 per minute, or $0.25-1.50 for a typical 5-minute call. This includes STT ($0.005-0.02/min), LLM inference ($0.01-0.10/call), TTS ($0.01-0.05/min), and telephony ($0.01-0.05/min). Compare to $1-3 per minute for human agents.

Sharing is caring

Insights from our team