Best solution for building a real-time voice-to-voice AI agent for phone calls?
Posted by SignatureHuman8057@reddit | LocalLLaMA | View on Reddit | 13 comments
Hi everyone,
I’m working with a customer who wants to deploy an AI agent that can handle real phone calls (inbound and outbound), talk naturally with users, ask follow-up questions, detect urgent cases, and transfer to a human when needed.
Key requirements:
- Real-time voice-to-voice (low latency, barge-in)
- Natural multi-turn conversations (not IVR-style)
- Ability to ask the right questions before answering
- Support for complex flows (qualification, routing, escalation)
- Ability to call custom tools or connect to an MCP client (to query internal systems, schedules, databases, etc.)
- Works at scale (thousands of minutes/month)
- Suitable for regulated industries (e.g. healthcare)
- Cost efficiency matters at scale
For those who’ve built or deployed something similar:
What’s the best approach or platform you’d recommend today, and why?
Would you go with an all-in-one solution or a more custom, composable stack?
Thanks in advance for your insights!
ExtremeAd9038@reddit
Noted
Shoddy-Individual-16@reddit
I've been looking into this lately, and for anyone trying to build a real-time voice agent without the lag, low latency is everything. If you're building for a production environment where you need to handle real business outcomes like lead qualification or support, it's worth checking out Nuplay by Nurix.ai.
They have a proprietary engine specifically for natural, real-time conversations that feels way smoother than a lot of the standard wrappers out there. It integrates directly with your existing knowledge base and CRM, which saves a ton of time on the backend orchestration.
Definitely a solid option if you're moving past the demo phase and need something enterprise-grade.
Missbutterscotchh@reddit
For real-time calls, the tough part is not selecting the model - it makes the conversation feel smooth, handles interruptions, and makes sure the actions will actually work. You also want flexibility to switch speech tools without rebuilding everything. We open-sourced Unpod as a modular voice AI orchestration layer that might help as an initial point: https://github.com/parvbhullar/unpod
Parker2010SEO@reddit
I recommend - https://console.neyox.ai/ (FREE 30 Credits to test)
The Best Price: Only $0.10/Min with your own Telephony (Telnyx/Twilio) so you are always in complete control.
To check voice, latency and response quality - https://www.youtube.com/@NeyoxAI/shorts
Good luck.
Neyox AI
thought_provoking27@reddit
For healthcare specifically, stability is more important than just raw speed, though you need both. I've built similar flows and found that stitching together Deepgram and Twilio manually gets messy when you have to handle 'barge-ins' (users interrupting the bot). Retell AI has been the most reliable for us in production because their interruption logic is way more aggressive than standard VADs. It stops speaking the instant the user starts, which prevents that awkward 'two robots talking over each other' issue.
Puzzleheaded-Rip2411@reddit
If you’re targeting regulated workflows, I’d avoid all-in-one “black box” agents and build a composable stack you can instrument end to end. Twilio for PSTN, Pipecat as the router, Deepgram for low-latency STT/TTS, then keep the LLM on a tight action contract so it can’t freestyle scheduling, triage, or policy answers. The voice is the easy part, the hard part is auditability, retries, and safe handoff when confidence drops. What’s the single highest-risk action you need this agent to do, scheduling, triage, or accessing internal records?
Ok-Register3798@reddit
If the goal is production-grade voice AI with minimal ops, I’d skip frameworks that are heavy on DIY setup.
Agora’s Conversational AI Engine checks every box you listed and is much easier to deploy than Pipecat:
I saw some responses mentioning Pipecat, while it is flexible, you pay for that flexibility with infra, tuning, and ongoing maintenance. Agora gives you better conversational control without self-hosting and tuning your own voice infra.
If speed to production and reliability matter, go all-in-one here.
YakEnvironmental792@reddit
This matches what a lot of teams run into once they go beyond demos.
Real-time barge-in, tool calls, and handoffs tend to break all-in-one platforms, and pipeline-level control matters more than isolated STT or latency metrics.
Some teams are moving toward more composable setups for predictable behavior (e.g. TEN):https://github.com/ten-framework/ten-framework
JackfruitElegant257@reddit
,this is a surprisingly tough stack to get right in practice-way more than just stitching whisper + tts + an llm. latency alone killed my first few attempts.
we ended up building something similar last year for appointment reminders in a clinic, and the real trick was getting barge-in and natural flow without that awkward "ai pause" that makes callers hang up. using a local model helped a bit with privacy but introduced its own pipeline headaches, especially around tool calling to their emr system via mcp.
honestly, after months of tuning, i kinda gave up on the fully custom route and moved to something more managed. btw i stumbled on coordinatehq recently-they handle the voice agent side with surprisingly low latency, and it plugs into their project system for context. it's not a pure dev platform, but for client-facing call workflows it took the infra burden off.
if you're deep in regulated data though, double-check their compliance. for full control you might still need to assemble your own stack with something like vocode or twilio + a local llm, but be ready for a latency/scale slog.
Over-Air-17@reddit
superu has provisions for phone numbers, different voices and accents with really low latency and its very affordable.
hackyroot@reddit
We recently hosted a webinar on how to build a voice agent using Pipecat and Simplismart.ai (full discloser: I work here). Happy to share webinar recording, if you're intersted.
We were able to get \~400ms latency and Pipecat + Simplismart fulfills all of your requests you shared above.
I prefer composable stack as it gives me freedom to choose best model for each modality. Qwen Omni is also a compelling model, as it supports voice to voice pipeline though inference is not widely supported.
PermanentLiminality@reddit
Twilio for the phone and look into Pipecat for the rest. Pipecat can use many text to speech, about any AI API and text to speech. I'm using Deepgram for speech to text and text to speech. It can use eleven labs top. I've used open ai, Google, and local LLMs
I've even used the voice enabled models from Google and OpenAI.
You do need to write some python to glue it into your systems.
ArticleSignal680@reddit
Seconding Pipecat, been using it for a similar project and it's solid. The modular approach is clutch when you need to swap out STT/TTS providers or tune for specific use cases. Deepgram's latency is pretty unbeatable for the price point too