Is real-time voice-to-voice still science fiction?

[-]

Double_Cause4609@reddit

"Science fiction" is a bit harsh.

It's also not a binary [yes/no] question; it's more of a spectrum.

For instance, does it count if you can do real time voice to voice with 8xH100? That can be "local". You can download the model...It's just...Really expensive.

Similarly, what about quality? You might get a model running in real time, but it has occasional hallucinations or artifacts. It's possible you may not want to pick those up unintentionally.

I'd say we're probably 60-70% of the way to real-time accessible speech to speech models for casual conversation, and probably about 20-40% of the way to models of such quality and meta-cognition (with the ability to reflect on their own outputs for educational purposes, and be aware of their inflections, etc), that you would want to use them for language learning extensively.

It'll take a few more advancements, but we already know the way there, it's just we have to implement it.

Notably, as soon as someone trains a speculative decoding head for any of the existing speech models that's probably what we need to really make it mostly viable, but a Diffusion speech to speech model would probably be ideal.

I'd say we're maybe about a year out (at most) from real time speech to speech (with possibly some need to customize the pipeline to your needs and available hardware).

So, not quite 100% of the way there, but calling it science fiction isn't quite fair when all the tools are already there and just need to be put together in the right order.

[-]

No_Afternoon_4260@reddit

one on the speculative decoding.

[-]

GrungeWerX@reddit

While I agree with your second %, your first is ridiculous. I speak to Ai STS every day and it’s pretty darn near talking to a real person. The only problem is persistent memory and personality consistency, which has led me to try building my own genetic AI system to address these issues. But as far as voice itself, we’re more like 90% there, maybe slightly higher, if you’re using eleven labs in a local pipeline.

Now, if you’re only rating open source voice models, then I would agree with you. I only referenced 11labs because you CAN use it in a local pipeline using your own LLMs.

[-]

GrungeWerX@reddit

While I agree with your second %, your first is ridiculous. I speak to Ai STS every day and it’s pretty darn near talking to a real person. The only problem is persistent memory and personality consistency, which has led me to try building my own genetic AI system to address these issues. But as far as voice itself, we’re more like 90% there, maybe slightly higher, if you’re using eleven labs in a local pipeline.

Now, if you’re only rating open source voice models, then I would agree with you. I only referenced 11labs because you CAN use it in a local pipeline using your own LLMs.

[-]

rainbowColoredBalls@reddit

Unrelated but what's the sota on tokenizing voice without going through the STT route?

[-]

guigouz@reddit

The speech to text part works in open-webui, not sure which lib they use, but you can try whisper for the transcription and coqui-tts for the responses.

Although not locally, the chatgpt app can do what you want even in the free plan, it does speak japanese and italian.

[-]

_moria_@reddit

Whisper is a bomb. Using the various declinations (whisperX, fast whisper etc...) you can do 3/4x! (Oma fat ram with only s 2080). The tts part is terrible for everything that is no English or Chinese

[-]

AnotherAvery@reddit

OuteTTS supports many languages and works well, but it's slow...

[-]

junior600@reddit (OP)

Oh, thanks! I’ll take a look at it. Yeah, I know chatgpt app can do that and it’s amazing… but it’s time-limited, and I’d still prefer having something similar locally, haha.

[-]

FullOf_Bad_Ideas@reddit

Unmute is nice, but it's only English

[-]

AnotherAvery@reddit

And French I think?

[-]

Traditional_Tap1708@reddit

Here’s how I built it. All local models and pretty much realtime

https://github.com/taresh18/conversify

[-]

bio_risk@reddit

Even if the model is local, the system is not local if you have to use livekit cloud.

[-]

radianart@reddit

As someone who is building project with llm and voice input\output I'd say it's very possible. Depends on how you define real time. With strong gpu and enough vram whisper (probably best STT) and llm can be very fast. I can't really guess cuz I only have 8gb vram but second or two from your phrase to answer is reachable I think.

[-]

BusRevolutionary9893@reddit

That's not voice to voice. That's voice to STT to LLM to TTS to voice.

[-]

urekmazino_0@reddit

Its very much possible, I have several systems running realtime voice chat with live avatars, if you know what that’s for.

[-]

junior600@reddit (OP)

Can you also use anime characters as live avatars?

[-]