I know this isn’t technically an LLM but OmniVoice is FUCKING AMAZING.

Posted by Borkato@reddit | LocalLLaMA | View on Reddit | 96 comments

Literally one shot voice cloning and it’s literally so easy. What the FUCK. It’s everything I’ve ever dreamed of.

Reply to Post

Reply

96 Comments

[-]

martinerous@reddit

Thanks for bringing up this one, I somehow missed it. It even speaks Latvian out of the box, which is amazing! How could they squeeze in so many languages into such a fast model? And I spent a week finetuning VoxCPM 1.5 to speak Latvian. Now my efforts are useless, haha.

Reply

[-]

stonstad@reddit

How does it compare to ElevenLabs TTS?

Reply

[-]

IONaut@reddit

Let's put it this way, I was able to clone voices with it that I couldn't clone on any other local model. I don't know if these have been tried on Elevenlabs but I was able to clone Gilbert Gottfried in full Gilbert and the Cylon voice from classic Battlestar Galactica.

Reply

[-]

bigman11@reddit

The fact that no one answered probably means it is nowhere near as good, lol.

Reply

[-]

the_bollo@reddit

That's always my question too since every local TTS I've tried in the last 3 years has been ass. I'm not helpful since OP said "it's perfect for my use case because I don't care about quality" :(

Reply

[-]

GregsWorld@reddit

More importantly other local models: Qwen3 TTS, LuxTTS, Chatterbox turbo etc...

Reply

[-]

temperature_5@reddit

**My** name is Werner Brandes. My voice is my passport. Verify me.

Reply

[-]

Available_Hornet3538@reddit

What is this. Link?

Reply

[-]

Borkato@reddit (OP)

https://github.com/k2-fsa/OmniVoice It’s so fucking cool. the cloning is so quick and literally perfect for my use case where I don’t care too much about perfect quality.

Reply

[-]

ptear@reddit

Just before I go try myself.. is this fucking cool decent quality local voice cloning?

Reply

[-]

Borkato@reddit (OP)

Hahaha was it that obvious? :p

Reply

[-]

ptear@reddit

Ok.. I tried it. It's super late here. Worth it.

Reply

[-]

Borkato@reddit (OP)

You liked it? :D I’m so glad I’m not the only one!

Reply

[-]

ptear@reddit

You had me at 3 second clip requirement and near real time generating.. locally.. I just needed to confirm.

Reply

[-]

Borkato@reddit (OP)

Oh so you tried the cloning?! I loved it haha. :p

Reply

[-]

ptear@reddit

Yes, same, same.

Reply

[-]

Borkato@reddit (OP)

Hahahaha I promise I have nothing that great! :p

Reply

[-]

bitslizer@reddit

How does Omni compared to qwen3 tts? The qwen3 voice instruct+voice design get it to almost Gemini 2.5 tts level for expressive and context understanding.

Reply

[-]

Borkato@reddit (OP)

I have no idea! Perhaps I should try it

Reply

[-]

bitslizer@reddit

I had claude spin up an Omni for me, the quality is not as good with the voice design, frequently generated noise rather than speech and I can't control/direct the emotional context and delivery style like I could with qwen3 tts local or Gemini 2.5 tts on the cloud. But it is very fast and less vram than qwen.

Reply

[-]

Borkato@reddit (OP)

I’m excited to try qwen3tts, people seem to say it’s good!

Reply

[-]

bitslizer@reddit

The default backend is not the fastest, if latency and time to first audio is the priority, you want to try the faster-qwen3-tts If you want a speech to speech loop quick chat bot to play with to get a sense what qwen3 TTS is capable of with minimum fuss, Google and install handcrafted persona engine. With good instruct prompt (generated by llm) the emotional range and expressiveness on voice design model is even better than what you will get from persona engine.

Reply

[-]

Borkato@reddit (OP)

This is super cool, thank you so much!!

Reply

[-]

IrisColt@reddit

heh

Reply

[-]

IrisColt@reddit

Thanks!!! >massively multilingual zero-shot text-to-speech (TTS) model supporting over 600 languages I'm gonna put that to the test, heh

Reply

[-]

optimisticalish@reddit

I'm always interested in speedy quality. How quick is "quick", and what's the graphics-card VRAM being used?

Reply

[-]

Borkato@reddit (OP)

Let me check! According to my process manager it’s using like 2.3GB VRAM, and it’s like, within seconds. So like, this whole thing, I’ll count the seconds: got it, 5 seconds! And that’s with a super slow voice. I can try on my 3090 if you’re curious I’m on an RTX 3090 and RTX 3070, but I’m using it on the 3070

Reply

[-]

optimisticalish@reddit

That sounds good, thanks. I might try it.

Reply

[-]

Borkato@reddit (OP)

Apparently you can also stream it too, which I don’t think I’m doing? I had ai vibe code it for me so I didn’t look too closely at the implementation, I’m not great at async coding even in the language I’m more familiar with

Reply

[-]

tazztone@reddit

a text to speech model.

Reply

[-]

tiger_ace@reddit

[https://github.com/k2-fsa/OmniVoice](https://github.com/k2-fsa/OmniVoice)

Reply

[-]

IrisColt@reddit

I just tried it, and it's hands down the best open-source voice cloning tool out there... and I was sleeping on it. Thanks for putting this on my radar!

Reply

[-]

TheRogoc@reddit

No impress, no postal address = bullshit service provider

Reply

[-]

Borkato@reddit (OP)

Huh? It’s a local thing you download and run

Reply

[-]

corsair-pirate@reddit

Does anyone know a native input for pause versus having to make multiple audio output and sticking then together with pauses. Some other models support things like [pause:2s]

Reply

[-]

beneath_steel_sky@reddit

About AMD GPU support... https://github.com/k2-fsa/OmniVoice/issues/67

Reply

[-]

Diablo-D3@reddit

Seems to just be developer confusion. They don't support *any* GPU, they use Pytorch. Pytorch is commercially supported by every major vendor under the sun: AMD, Nvidia, Intel, half a dozen others. Anything written using Pytorch, I can run on my AMD GPU.

Reply

[-]

nickludlam@reddit

You're right, it's actually really good. At least on par with Voxtral

Reply

[-]

_supert_@reddit

Can it do prosody like voxtral?

Reply

[-]

lunerift@reddit

Yeah, voice models are catching up fast - but the “wow” phase hides some issues. Cloning is easy now, controlling tone and consistency over longer outputs is still tricky. Also curious how it behaves outside clean samples - noisy input, different accents, etc.

Reply

[-]

Borkato@reddit (OP)

Thanks Claude

Reply

[-]

tilapio@reddit

Can it generate VTT?

Reply

[-]

zaypen@reddit

currently using qwen 3 tts, have you tried this and happened to have some comparison?

Reply

[-]

biogoly@reddit

It’s better than qwen3 tts. Super fast too.

Reply

[-]

meganoob1337@reddit

how is the vram footprint compared to qwen3 tts 0.6b? it's currently my home assistant TTS driver as it fits together with an LLM in my vram

Reply

[-]

Borkato@reddit (OP)

For me the VRAM usage was 2.3GB for omnivoice, I haven’t used qwen TTS so I wouldn’t know that one!

Reply

[-]

meganoob1337@reddit

hmm that sounds decent. I think I have around 3.5-4gb for inference with qwen

Reply

[-]

Hans-Wermhatt@reddit

I run on Windows... but I used to use qwen 3 tts and then switched to cosy voice 3 and now just switched to omni voice and this is the only model I got decent TTFT with. I'm getting 300-450 ms depending on the sentence size and the quality is much better than Kokoro. I actually think it's better than qwen 3 and cosy voice for English speakers. I assume those models are better for Chinese, but I was only content with the English speakers I used.

Reply

[-]

Borkato@reddit (OP)

I have not unfortunately :(

Reply

[-]

ConsciousDissonance@reddit

Does the voice cloning work on sample sequences longer than the 3 - 10s range. A lot of the character in voices is related to how specific words are pronounced which may not be reflected in such short clips. It would be great if it could scale to larger sequences, at least the 1 - 5 minute range or more. I’m thinking of the equivalent to elevenlabs instant and professional voice cloning.

Reply

[-]

biogoly@reddit

No, the zero-shot cloning will actually be worse if the sample is too long. It works best with about 10-15 seconds. Frankly, it’s amazing how much nuance it can pick up in just a few seconds of audio. If you want the best and most consistent clone you need to train a fine-tune of the model on several minutes (45+ min) of audio, ideally with a good variety of vocal prosody and emotional range.

Reply

[-]

Borkato@reddit (OP)

Wait how do you finetune it?

Reply

[-]

Blizado@reddit

Omnivoice (Github code) comes with all you need for training.

Reply

[-]

Borkato@reddit (OP)

Oh man. Thank you, I’ll look into this, I hope I can make specific characters even better than the one shot clones!

Reply

[-]

Borkato@reddit (OP)

From what I understand I don’t think so, but honestly it worked so well with the 3-10 second range I’m super surprised. I do wish I could do multiple clips though for multiple emotions unless I just haven’t tried yet?

Reply

[-]

Paradigmind@reddit

Can it alter cloned voices so that it just sounds "alike"?

Reply

[-]

Borkato@reddit (OP)

I don’t think so; once they’re cloned they’re stuck I believe. But you can just make a new one, it takes a few seconds!

Reply

[-]

Paradigmind@reddit

I mean during the cloning process. Would be cool if there was a setting to control some degree of deviation. I'll need to install and check it out. Thanks for the reply.

Reply

[-]

dzedaj@reddit

What about F5-TTS ? heard it's better than OmniVoice - does anybody have experience with it?

Reply

[-]

Borkato@reddit (OP)

I tried to get this one set up a while ago (like a year ago?) and it was ridiculously complex for some reason. It also seemed like it had much worse luck at cloning.

Reply

[-]

jfufufj@reddit

Does it support like producing 10-20 mins of audio? I'm thinking of dubbing some videos

Reply

[-]

roculus@reddit

I've generated 40 minutes of audio in one shot. It didn't seem to have a problem doing it.

Reply

[-]

jfufufj@reddit

Can you control how long is the audio file? If it can thatd be wild, we can definitely use it to dub video then.

Reply

[-]

Borkato@reddit (OP)

Yes! There’s a duration parameter. It explicitly has dubbing videos as a reason they added it!

Reply

[-]

Borkato@reddit (OP)

Wait so you can just throw in an entire book and it just reads it?? Can you also do longer cloning? Like give it multiple 10-second snippets? Since I don’t think it can go over 10

Reply

[-]

roculus@reddit

I haven't tried using a sample voice more than like 15 seconds. 40 minutes just happened to be the longest story I had created in an LLM that I fed it. That 40 minutes was a slower speaking sample. I used an ASMR type whisper voice so I could listen to the story while going to sleep.

Reply

[-]

Borkato@reddit (OP)

That’s very interesting. I’m going to try for sure! Thank you so much

Reply

[-]

roculus@reddit

I just finished generating a 55 minute audio file using OmniVoice. It took 6 minutes, 45 seconds. That's with an average speaking speed voice. (using RTX 6000 PRO...similar speed to a 5090), It used less than 10GB VRAM. That was with 32 inference steps (the default) with de-noise checked. I guess double that time if you wanted to use 64 steps.

Reply

[-]

Borkato@reddit (OP)

That is amazing, wow!!! I’m going to mess with it immediately haha

Reply

[-]

ShengrenR@reddit

for that use case, you likely want index-tts 2 specifically.

Reply

[-]

Borkato@reddit (OP)

Hmmm I don’t know, I don’t think so? I think you’d have to stitch them all together, but I haven’t really tried, hmm

Reply

[-]

Stitch10925@reddit

Can you use it to make your models speak? If so, how?

Reply

[-]

Borkato@reddit (OP)

Absolutely! You have to run the omnivoice server and then send it a json post request. Copy and paste the readme and send it to your favorite AI and it can help!

Reply

[-]

-BananaStand-@reddit

I just got it running off my mac!!! Made a Tobias Fünke reading a rap about kittens and Ice cream cones. The quality is great! I just started to teach myself how to use local LLM last week. I have never used LM Studio, home brew, python, or even terminal before. Learned a little bit on how to use Audacity tonight.

Reply

[-]

Borkato@reddit (OP)

Omg welcome!!! You’re doing great, getting this installed is a great early project :) now you’ve gotta learn vim and ranger! Hahaha just kidding, but they’re both daily drivers for me!

Reply

[-]

Western_Courage_6563@reddit

Better than chatterbox?

Reply

[-]

basil232@reddit

Yeah, it's a great model. Too bad there isn't an implementation that runs well on CPU. They [apparently](https://huggingface.co/k2-fsa/OmniVoice/discussions/2) have no plans to add that.

Reply

[-]

noposts4010@reddit

wow just gave this a try and blown away by how easy it is. runs flawlessly on my mbp

Reply

[-]

caetydid@reddit

which languages are supported well?

Reply

[-]

nmfisher@reddit

Very impressive, most voice cloning fails for my accent (Australian) but this actually nailed it.

Reply

[-]

Accomplished_Bet_127@reddit

I think that it should be fine to drop some new things here too, until they get a weight to get on a category of their own. After all, we are long past from discussing LLaMa

Reply

[-]

SM8085@reddit

Do you have a clone example of someone public you can post to [https://vocaroo.com/](https://vocaroo.com/) ? Have you messed with Qwen3-TTS? If so, how does it compare?

Reply

[-]

Accomplished_Bet_127@reddit

There is online demo. Upload voice example you'll have what you want

Reply

[-]

Borkato@reddit (OP)

Lol other than gay porn, not really! 😂 and no unfortunately I haven’t

Reply

[-]

Borkato@reddit (OP)

Lol other than gay porn, not really! 😂

Reply

[-]

roculus@reddit

It also supports 600 languages which seems to be important for a lot of people that immediately ask if a TTS can speak their language. https://github.com/k2-fsa/OmniVoice/blob/master/docs/languages.md

Reply

[-]

fredandlunchbox@reddit

Anyone know of a model that can do extension? Maybe this is just a code problem, but I'd like to be able to do: 1. "This is an example of" 2. "extension using a voice model" and have it sound natural without changing prosody.

Reply

[-]

urarthur@reddit

tts quality is basic

Reply

[-]

Borkato@reddit (OP)

Really? Not for me! For me it captures the voice good enough for anything

Reply

[-]

o0genesis0o@reddit

What would be the use case of voice cloning? Is it like to make voice over without actually having to record voice over?

Reply

[-]

Borkato@reddit (OP)

Yeah, or just having fun with your favorite creators’ voice lol

Reply

[-]

o0genesis0o@reddit

Oh, so those short videos that summarize a movie badly, using the voice of the main actor, is created like this? Learn something today.

Reply

[-]

Borkato@reddit (OP)

Oh probably! I haven’t seen those though haha

Reply

[-]

StardockEngineer@reddit

Omnivoice is crazy good.

Reply

[-]

Stepfunction@reddit

Actually, OmniVoice technically is an LLM. It uses Qwen 3 under the hood.

Reply

[-]

Borkato@reddit (OP)

Oh wow!

Reply