Real-time conversation with a character on your local machine
Posted by ResolveAmbitious9572@reddit | LocalLLaMA | View on Reddit | 33 comments
And also the voice split function
Sorry for my English =)
Shockbum@reddit
You are a hero!
delobre@reddit
Unfortunately, these TTS systems, such as Kokoro TTS, don’t support emotions yet, which makes the characters sound less authentic. I genuinely hope we’ll be able to stream something similar to Sesame in real time.
sophosympatheia@reddit
Chatterbox is getting close. Its voice cloning fidelity is great, and it can do emotional intonation surprisingly well. However, it doesn't support tags to help guide the emotion, so frequently you end up with outputs that don't fit the tone of the scene. But it's getting there. I wouldn't be surprised if within a year we have something that is roughly equivalent to Elevenlabs V3 that they just released.
ShengrenR@reddit
Yea, chatterbox is pretty nice - especially for the size; zonos is best to date in my eyes for steerable emotions, but just needs a lot of hand-holding to get 'that one good one' - I'd likely make a set of emotions via zonos and use them as references for chatterbox.. once the streaming is cleaned up.
EuphoricPenguin22@reddit
Dia TTS is another one that has pretty decent expressive capabilities as well.
MrDevGuyMcCoder@reddit
Is this the one that only release pickle not safetensor?
EuphoricPenguin22@reddit
It has a safetensors weight available.
Turkino@reddit
Hopefully we can get an open source version of something like this in the coming months.
https://www.youtube.com/watch?v=zv_IoWIO5Ek
iwalg@reddit
Oh yeah, something like that would be totally wild. Dam that v3 sounds good, real good!!
LordNikon2600@reddit
Go seek emotion from real people
YT_Brian@reddit
But then you have to deal with their monkey and circus when we can have designer monkey and circus instead?
VrFrog@reddit
Very polished, Great job!
Knopty@reddit
If the goal is to make it more realistic, the user should be able to interrupt the character like in a real dialogue. And remaining unspoken context to be deleted or optionally converted to some text carrying a vague summary what was intended to say.
Maleficent_Age1577@reddit
Thats not a character, thats just a picture.
Witty-Forever-6985@reddit
Link when
jeffwadsworth@reddit
He already posted it.
the_general1@reddit
Any chance of sharing the github repo?
ResolveAmbitious9572@reddit (OP)
https://github.com/PioneerMNDR/MousyHub
Life_Machine_9694@reddit
Very nice - need a hero to replicate this for Mac and show us novices how to do it
ResolveAmbitious9572@reddit (OP)
MousyHub can be compiled on MacOS, but you still need a hero to test it)
kkb294@reddit
Waiting for the same 🤞
LocoMod@reddit
Very cool. Why do they talk so fast?
ResolveAmbitious9572@reddit (OP)
In the settings, I sped up the playback speed so that the video was not too long.
LocoMod@reddit
My patience thanks you for that. I have a webGPU implementation here that greatly simplifies deploying Kokoro. It allows for virtually unlimited and almost seamless generation. It might be helpful or it might not. :)
https://github.com/intelligencedev/manifold/blob/master/frontend/src/composables/useTtsNode.js
Own-Potential-2308@reddit
Niceee brooo!
Chromix_@reddit
This reminds me of the voice chat in the browser that was posted a day before. The response latency seems even better there - maybe due to a different model size, or slightly different approach?
For those using Kokoro (like here) it might be of interest that there's somewhat working voice cloning functionality by now.
ResolveAmbitious9572@reddit (OP)
The delay here is because I did not add the TTS model separately for recognition, but used TTS inside the browser (it turns out the browser is not bad at this). That's why a user with 8 GB VRAM will not be able to run so many models on his machine. By the way, Kokoro uses only CPU here. Kokoro developer, you are cool =).
Chromix_@reddit
Ah, nice that it runs with lower-end hardware then - this also means there's optimization potential for those with a high-end GPU.
Cool-Chemical-5629@reddit
I knew it was worth waiting for someone crazy enough to do this from scratch using these modern technologies. I mean it in a good way, good job! 😉
Expensive-Paint-9490@reddit
Will try it out!
ResolveAmbitious9572@reddit (OP)
MousyHub supports local models using the llama.cpp library (LLamaSharp)
Asleep-Ratio7535@reddit
Looks great, and you have different voices for different characters.
ResolveAmbitious9572@reddit (OP)
https://github.com/PioneerMNDR/MousyHub
This lightweight and functional app is an alternative to SillyTavern.