Is real-time voice-to-voice still science fiction?
Posted by junior600@reddit | LocalLLaMA | View on Reddit | 21 comments
Hi everyone, as the title says: is it possible to have real-time voice-to-voice interaction running locally, or are we still not there yet?
I'd like to improve my speaking skills (including pronunciation) in English and Japanese, and I thought it would be great to have conversations with a local LLM.
It would also be nice to have something similar in Italian (my native language) for daily chats, but I assume it's not a very "popular" language to train on. lol
Double_Cause4609@reddit
"Science fiction" is a bit harsh.
It's also not a binary [yes/no] question; it's more of a spectrum.
For instance, does it count if you can do real time voice to voice with 8xH100? That can be "local". You can download the model...It's just...Really expensive.
Similarly, what about quality? You might get a model running in real time, but it has occasional hallucinations or artifacts. It's possible you may not want to pick those up unintentionally.
I'd say we're probably 60-70% of the way to real-time accessible speech to speech models for casual conversation, and probably about 20-40% of the way to models of such quality and meta-cognition (with the ability to reflect on their own outputs for educational purposes, and be aware of their inflections, etc), that you would want to use them for language learning extensively.
It'll take a few more advancements, but we already know the way there, it's just we have to implement it.
Notably, as soon as someone trains a speculative decoding head for any of the existing speech models that's probably what we need to really make it mostly viable, but a Diffusion speech to speech model would probably be ideal.
I'd say we're maybe about a year out (at most) from real time speech to speech (with possibly some need to customize the pipeline to your needs and available hardware).
So, not quite 100% of the way there, but calling it science fiction isn't quite fair when all the tools are already there and just need to be put together in the right order.
No_Afternoon_4260@reddit
GrungeWerX@reddit
While I agree with your second %, your first is ridiculous. I speak to Ai STS every day and it’s pretty darn near talking to a real person. The only problem is persistent memory and personality consistency, which has led me to try building my own genetic AI system to address these issues. But as far as voice itself, we’re more like 90% there, maybe slightly higher, if you’re using eleven labs in a local pipeline.
Now, if you’re only rating open source voice models, then I would agree with you. I only referenced 11labs because you CAN use it in a local pipeline using your own LLMs.
GrungeWerX@reddit
While I agree with your second %, your first is ridiculous. I speak to Ai STS every day and it’s pretty darn near talking to a real person. The only problem is persistent memory and personality consistency, which has led me to try building my own genetic AI system to address these issues. But as far as voice itself, we’re more like 90% there, maybe slightly higher, if you’re using eleven labs in a local pipeline.
Now, if you’re only rating open source voice models, then I would agree with you. I only referenced 11labs because you CAN use it in a local pipeline using your own LLMs.
rainbowColoredBalls@reddit
Unrelated but what's the sota on tokenizing voice without going through the STT route?
guigouz@reddit
The speech to text part works in open-webui, not sure which lib they use, but you can try whisper for the transcription and coqui-tts for the responses.
Although not locally, the chatgpt app can do what you want even in the free plan, it does speak japanese and italian.
_moria_@reddit
Whisper is a bomb. Using the various declinations (whisperX, fast whisper etc...) you can do 3/4x! (Oma fat ram with only s 2080). The tts part is terrible for everything that is no English or Chinese
AnotherAvery@reddit
OuteTTS supports many languages and works well, but it's slow...
junior600@reddit (OP)
Oh, thanks! I’ll take a look at it. Yeah, I know chatgpt app can do that and it’s amazing… but it’s time-limited, and I’d still prefer having something similar locally, haha.
FullOf_Bad_Ideas@reddit
Unmute is nice, but it's only English
AnotherAvery@reddit
And French I think?
Traditional_Tap1708@reddit
Here’s how I built it. All local models and pretty much realtime
https://github.com/taresh18/conversify
bio_risk@reddit
Even if the model is local, the system is not local if you have to use livekit cloud.
radianart@reddit
As someone who is building project with llm and voice input\output I'd say it's very possible. Depends on how you define real time. With strong gpu and enough vram whisper (probably best STT) and llm can be very fast. I can't really guess cuz I only have 8gb vram but second or two from your phrase to answer is reachable I think.
BusRevolutionary9893@reddit
That's not voice to voice. That's voice to STT to LLM to TTS to voice.
urekmazino_0@reddit
Its very much possible, I have several systems running realtime voice chat with live avatars, if you know what that’s for.
junior600@reddit (OP)
Can you also use anime characters as live avatars?
urekmazino_0@reddit
Yes
teleprint-me@reddit
🤣😅 I love this response.
Not_your_guy_buddy42@reddit
Try this one https://github.com/Lex-au/Vocalis
mbanana@reddit
This is basically a toy system, but it works quite well.