Best real-time speech-to-speech model?

Posted by ffinzy@reddit | LocalLLaMA | View on Reddit | 29 comments

We've been using unmute, and it's the best open source real-time STT -> LLM -> TTS model/system that I know so far.

Now we're looking for a more accurate STT while maintaining real-time speed and high throughput. Ideally the model is speech-to-speech directly so the AI can provide feedback on the input voice itself and not just the transcription.

We want to try the Qwen3-Omni but AFAIK there's no speech-to-speech support in vLLM yet. There's a hosted model but we want to use the open source if possible.

We're building a free real-time AI app for people to practice their English speaking skills.

[-]

Rough_Lifeguard_3048@reddit

Running Gemini Live native audio in production for \~6 months (voice-first founder discovery, not English practice but same S2S constraint). Three reasons hybrid STT→LLM→TTS will hurt you specifically:

1. Pronunciation feedback is impossible with hybrid. STT gives you text — all phonetic info is gone before LLM sees it. S2S models hear the actual phonemes. For English practice this is the whole point.

2. Latency stacks. Hybrid = 3 models in series, \~800-1200ms even tuned. Native S2S = 300-500ms. Below 600ms users feel "conversation"; above 800ms they feel "command."

3. Backchannels + barge-in. Native S2S handles "uh-huh", interruptions, hesitation natively. Hybrid needs custom VAD + orchestration and still feels robotic.

Open-source S2S worth checking: Moshi (Kyutai) — fully open, \~200ms, decent English. GLM-4-Voice also solid. Qwen3-Omni S2S support is coming to vLLM but not stable yet.

Heads up if you do try Gemini Live: proactive_audio flag silently broke text-triggered greetings in early March — costs nothing to disable, saves a week of debugging

[-]

International-Poem64@reddit

at this momento we use aws nova-2-sonic its paid , but its the only that offer speech to speech conversation, and its very good.

[-]

ComprehensiveLog7437@reddit

Hello, what is the average response time you have achieved with unmute ? Thanks in advance

[-]

sistsalcedo@reddit

2 a 3 seg

[-]

phhusson@reddit

I'm on the same boat. I did cool (imo) demos with unmute with function calling and fillers, but the STT is really not great (one reason being that it doesn't have any culture, like it probably doesn't know minecraft).

I've started hooking a good old whisper into unmute (basically it uses Kyutai's STT as Semantic VAD + kv-cache- heating LLM, but the actual answer is whisper), I haven't finished, but that looks promising.

I'm rather optimistic with Qwen3-Omni, though yeah it requires writing a lot of code: There is first the whole interaction/rendering code on top of the model to write, but it looks like it even requires fixing the model's code in huggingface's transformers (because it doesn't support streaming & it's slow in python sections) -- and I would much rather have someone else than me do that

[-]

Double-Lavishness870@reddit

Interesting ideas. How was this going in the meantime?

I am looking for STT alternatives, because of missing culture and language. I am honking stuff together with Conformer and smart-Turn-V3, but Iam not happy yet.

[-]

ffinzy@reddit (OP)

Sorry for the late reply, see my comment here.

[-]

ffinzy@reddit (OP)

Your approach is interesting. If you’re open to it, please keep me posted on your progress.

Instead of replacing the default Unmute SST, I’ve been considering running a second pass on the audio input with a good old Whisper.

The idea is that the default SST will mainly used for real time interactivity and instant response, while on the background whisper will feed transcription correction to the LLM.

[-]

Double-Lavishness870@reddit

Same here, like my comment to the initial part. How is the approach going?

[-]

GrabbenD@reddit

u/ffinzy Any updates?

[-]

ffinzy@reddit (OP)

No update. We had a very weird and intermittent problem with the Unmute that the AI won't recognize any speech for a couple minutes. We spent almost a month of our free time to diagnose the issue and we haven't found the root cause. We're not sure whether the problem is in the codebase or it's because of our hardware/network.

Since then, we started to invest into other frameworks, particularly self-hosting Livekit. The nice thing about that is they have SDK for React Native/Swift/Kotlin, so we're very tempted to just switch the platform altogether.

[-]

Vegetable-Media-5999@reddit

We have been experimenting with real time speech to speech setups where turn detection and latency matter more than the individual models

An open source framework we use focuses on streaming audio explicit turn handling and session memory so you can swap STT LLM and TTS as needed

Might be useful for your use case

https://github.com/ten-framework/ten-framework

[-]

Miserable-Dare5090@reddit

the guy behind MLX-audio recently released a small, fast TTS model that might serve your needs. I am personally waiting for a STT or SALM/ALM that recognizes speakers. Pyannote open source is an unsupported pain

[-]

fullouterjoin@reddit

MLX-audio

https://github.com/Blaizzy/mlx-audio

[-]

Miserable-Dare5090@reddit

no speaker diarization for STT

[-]

fullouterjoin@reddit

https://github.com/jfgonsalves/parakeet-diarized (uses pyannote)

https://github.com/pyannote/pyannote-audio only 22 issues and 18 pull requests, doesn't look toooooo horrible?

Oh ... I see they have a paid thing https://www.pyannote.ai/ so they aren't going to want the OSS pyannote to get good. Lame.

https://github.com/FluidInference/FluidAudio

[-]

Miserable-Dare5090@reddit

Yes as I noted, pyannote audio went private with argmax, fluid audio’s implementations are not yet diarizing well, and the diarized parakeet python program by jfgonsalves is not compiling for me. Paid options from Argmax are the best solution right now, but not open source.

[-]

YessikaOhio@reddit

I know this isn't what you're looking for, but I'm sure people will find your post just wanting the STT to LLM to TTS. I set up a Whisper to Local LLM to Kokoro for simple speech to speech. It's not what you're asking for, but anything I found wasn't very easy to use or set up, so I made something I could use.

I wish there was a simple TTS that could understand how you are talking, not just the words you are saying. That would be awesome.

https://www.reddit.com/r/LocalLLaMA/comments/1numy9a/im_sharing_my_first_github_project_real_ish_time/

[-]

SOCSChamp@reddit

Has nobody gotten qwen 3 omni working for this yet? I feel like this the main use case I was waiting for but I haven't seen live speech to speech demonstrated

[-]

ffinzy@reddit (OP)

This is what I’m waiting as well. This is why I started this thread.

[-]

Normal-Ad-7114@reddit

Ideally the model is speech-to-speech directly so the AI can provide feedback on the input voice itself and not just the transcription

We have yet to see this kind of sorcery

[-]

nickless07@reddit

Qwen2.5/3 Omni?

[-]

ffinzy@reddit (OP)

Well, I said it because it’s impossible to do that with the STT, LLM, TTS system.

[-]

dinerburgeryum@reddit

Yeah not in the open source space, which really stinks. Wish I had the time to put one together tbh.

[-]

favonius_@reddit

“Moshi,” by the same authors as unmute, is the only one I’m aware of. It’s impressive that its novel design works at all, but it’s a year old now and I don’t think it ever matched the intelligence of simply running a fast LLM in the TTS/STT setup you described

[-]

ffinzy@reddit (OP)

I’m curious about the Qwen3-Omni, but I’m not sure about the throughput and the real-time aspect for speech-to-speech.

Good to know that Moshi/Unmute is the best OSS solution that we have right now.

[-]

AmIDumbOrSmart@reddit

No good ones. Conversational TTS like Sesame 1.5b and Orpheus are the last ones I remember but theyre pretty heavy and far from real time despite their jank.

If you just want fast and quality, probably Kokoro is your best bet. It's not smart but at least it sounds nice and is fast on any decent gpu.

[-]

ffinzy@reddit (OP)

Thanks. I still remember the pain of being jebaited by Sesame.

[-]

AmIDumbOrSmart@reddit

even if they released it would be a massive 7b model and would need several h100's on a high speed link to run at real time. We also have vibevoice large now which rivals it somewhat in quality- but again requires 24gb vram and takes like 20-40 seconds to render. Though that does have streaming and with sage attention can get the wait down to lioke \~7-10 seconds or so