I know this isn’t technically an LLM but OmniVoice is FUCKING AMAZING.
Posted by Borkato@reddit | LocalLLaMA | View on Reddit | 55 comments
Literally one shot voice cloning and it’s literally so easy. What the FUCK. It’s everything I’ve ever dreamed of.
zaypen@reddit
currently using qwen 3 tts, have you tried this and happened to have some comparison?
biogoly@reddit
It’s better than qwen3 tts. Super fast too.
meganoob1337@reddit
how is the vram footprint compared to qwen3 tts 0.6b? it's currently my home assistant TTS driver as it fits together with an LLM in my vram
Hans-Wermhatt@reddit
I run on Windows... but I used to use qwen 3 tts and then switched to cosy voice 3 and now just switched to omni voice and this is the only model I got decent TTFT with. I'm getting 300-450 ms depending on the sentence size and the quality is much better than Kokoro. I actually think it's better than qwen 3 and cosy voice for English speakers. I assume those models are better for Chinese, but I was only content with the English speakers I used.
Borkato@reddit (OP)
I have not unfortunately :(
basil232@reddit
Yeah, it's a great model. Too bad there isn't an implementation that runs well on CPU. They apparently have no plans to add that.
noposts4010@reddit
wow just gave this a try and blown away by how easy it is. runs flawlessly on my mbp
ConsciousDissonance@reddit
Does the voice cloning work on sample sequences longer than the 3 - 10s range. A lot of the character in voices is related to how specific words are pronounced which may not be reflected in such short clips. It would be great if it could scale to larger sequences, at least the 1 - 5 minute range or more. I’m thinking of the equivalent to elevenlabs instant and professional voice cloning.
biogoly@reddit
No, the zero-shot cloning will actually be worse if the sample is too long. It works best with about 10-15 seconds. Frankly, it’s amazing how much nuance it can pick up in just a few seconds of audio. If you want the best and most consistent clone you need to train a fine-tune of the model on several minutes (45+ min) of audio, ideally with a good variety of vocal prosody and emotional range.
Borkato@reddit (OP)
From what I understand I don’t think so, but honestly it worked so well with the 3-10 second range I’m super surprised. I do wish I could do multiple clips though for multiple emotions unless I just haven’t tried yet?
Paradigmind@reddit
Can it alter cloned voices so that it just sounds "alike"?
Borkato@reddit (OP)
I don’t think so; once they’re cloned they’re stuck I believe. But you can just make a new one, it takes a few seconds!
Paradigmind@reddit
I mean during the cloning process. Would be cool if there was a setting to control some degree of deviation. I'll need to install and check it out. Thanks for the reply.
caetydid@reddit
which languages are supported well?
nmfisher@reddit
Very impressive, most voice cloning fails for my accent (Australian) but this actually nailed it.
Stitch10925@reddit
Can you use it to make your models speak? If so, how?
jfufufj@reddit
Does it support like producing 10-20 mins of audio? I'm thinking of dubbing some videos
ShengrenR@reddit
for that use case, you likely want index-tts 2 specifically.
roculus@reddit
I've generated 40 minutes of audio in one shot. It didn't seem to have a problem doing it.
jfufufj@reddit
Can you control how long is the audio file? If it can thatd be wild, we can definitely use it to dub video then.
Borkato@reddit (OP)
Wait so you can just throw in an entire book and it just reads it??
Can you also do longer cloning? Like give it multiple 10-second snippets? Since I don’t think it can go over 10
roculus@reddit
I haven't tried using a sample voice more than like 15 seconds. 40 minutes just happened to be the longest story I had created in an LLM that I fed it. That 40 minutes was a slower speaking sample. I used an ASMR type whisper voice so I could listen to the story while going to sleep.
Borkato@reddit (OP)
That’s very interesting. I’m going to try for sure! Thank you so much
roculus@reddit
I just finished generating a 55 minute audio file using OmniVoice. It took 6 minutes, 45 seconds. That's with an average speaking speed voice. (using RTX 6000 PRO...similar speed to a 5090), It used less than 10GB VRAM.
That was with 32 inference steps (the default) with de-noise checked. I guess double that time if you wanted to use 64 steps.
Borkato@reddit (OP)
Hmmm I don’t know, I don’t think so? I think you’d have to stitch them all together, but I haven’t really tried, hmm
Available_Hornet3538@reddit
What is this. Link?
Borkato@reddit (OP)
https://github.com/k2-fsa/OmniVoice
It’s so fucking cool. the cloning is so quick and literally perfect for my use case where I don’t care too much about perfect quality.
ptear@reddit
Just before I go try myself.. is this fucking cool decent quality local voice cloning?
Borkato@reddit (OP)
Hahaha was it that obvious? :p
ptear@reddit
Ok.. I tried it. It's super late here. Worth it.
optimisticalish@reddit
I'm always interested in speedy quality. How quick is "quick", and what's the graphics-card VRAM being used?
Borkato@reddit (OP)
Let me check!
According to my process manager it’s using like 2.3GB VRAM, and it’s like, within seconds. So like, this whole thing, I’ll count the seconds: got it, 5 seconds! And that’s with a super slow voice. I can try on my 3090 if you’re curious
I’m on an RTX 3090 and RTX 3070, but I’m using it on the 3070
optimisticalish@reddit
That sounds good, thanks. I might try it.
Borkato@reddit (OP)
Apparently you can also stream it too, which I don’t think I’m doing? I had ai vibe code it for me so I didn’t look too closely at the implementation, I’m not great at async coding even in the language I’m more familiar with
tazztone@reddit
a text to speech model.
tiger_ace@reddit
https://github.com/k2-fsa/OmniVoice
Accomplished_Bet_127@reddit
I think that it should be fine to drop some new things here too, until they get a weight to get on a category of their own. After all, we are long past from discussing LLaMa
stonstad@reddit
How does it compare to ElevenLabs TTS?
SM8085@reddit
Do you have a clone example of someone public you can post to https://vocaroo.com/ ?
Have you messed with Qwen3-TTS? If so, how does it compare?
Accomplished_Bet_127@reddit
There is online demo. Upload voice example you'll have what you want
Borkato@reddit (OP)
Lol other than gay porn, not really! 😂 and no unfortunately I haven’t
Borkato@reddit (OP)
Lol other than gay porn, not really! 😂
-BananaStand-@reddit
I just got it running off my mac!!!
Made a Tobias Fünke reading a rap about kittens and Ice cream cones. The quality is great!
I just started to teach myself how to use local LLM last week. I have never used LM Studio, home brew, python, or even terminal before. Learned a little bit on how to use Audacity tonight.
roculus@reddit
It also supports 600 languages which seems to be important for a lot of people that immediately ask if a TTS can speak their language.
https://github.com/k2-fsa/OmniVoice/blob/master/docs/languages.md
fredandlunchbox@reddit
Anyone know of a model that can do extension? Maybe this is just a code problem, but I'd like to be able to do:
and have it sound natural without changing prosody.
urarthur@reddit
tts quality is basic
Borkato@reddit (OP)
Really? Not for me! For me it captures the voice good enough for anything
o0genesis0o@reddit
What would be the use case of voice cloning? Is it like to make voice over without actually having to record voice over?
Borkato@reddit (OP)
Yeah, or just having fun with your favorite creators’ voice lol
o0genesis0o@reddit
Oh, so those short videos that summarize a movie badly, using the voice of the main actor, is created like this? Learn something today.
Borkato@reddit (OP)
Oh probably! I haven’t seen those though haha
nickludlam@reddit
You're right, it's actually really good. At least on par with Voxtral
StardockEngineer@reddit
Omnivoice is crazy good.
Stepfunction@reddit
Actually, OmniVoice technically is an LLM. It uses Qwen 3 under the hood.
Borkato@reddit (OP)
Oh wow!