Best Open Source Voice Cloning if you have lots of reference audio?
Posted by SlaveToBuy@reddit | LocalLLaMA | View on Reddit | 20 comments
I've been using elevenlabs and burning lots of money now regenerating because for some reason my voice is speaking in multiple accents now. Basically with my cloned voice I am looking for something that can be consistent, not conversational life. I have a lot of reference audio. Is it possible to get something identical to what elevenlabs can do?
3080
32 gb of RAM
Any help would be appreciated.
archadigi@reddit
You are looking for an ElevenLabs-like software at an affordable cost, and the choice you specified, like VidBeVoice, will be a good application. Since you have a lot of reference audio, trying and creating it itself is a big issue, as you will lose credits for each attempt until you achieve consistency.
Use Pixbim Voice Clone AI. This is one of the best voice cloning apps and is ideal for your requirement. It is the most affordable offline voice cloning software available now. With a one-time fee, you get unlimited voice cloning usage for a lifetime. You can use it as many times as you want, and it will work perfectly on your machine, especially since you have an RTX 3080 with 12GB VRAM, which is faster than most normal laptops. You can collate or try out all the reference audio and achieve a consistent voice cloning output better than many other tools.
I use Pixbim Voice Clone AI for my story narration, narrating books in my own voice. These narrated outputs run for several hours. I have completed more than 1000+ hours of narration with just a one-time fee. It is really meant for heavy users. I try different tones for each individual story and work on refining the voice-cloned output. If it is not consistent, I fine-tune the reference audio and recreate it. No matter how many times you try, it won’t cost you extra. It is just $59 for unlimited voice cloning usage.
k8-bit@reddit
Ive found Omnivoice loses the plot with reference audio more than 20s. Vibevoice gobbles up 2 mins of reference audio with great if occasionally eccentric results.
SlaveToBuy@reddit (OP)
Omnivoice didn't sound too good for me either. I was going to try VibeVoice but didn't get around to it since I've heard the generation is long.
k8-bit@reddit
Vibevoice via gradio interface starts streaming audio after about 15 seconds alowing me to check output quality as it goes. This on a 3090. I use the q8 and q4 quantised version in comfy on a 16gb 5060ti happily as well.
SlaveToBuy@reddit (OP)
How is the quality? I have a 3080 12gb
k8-bit@reddit
I really like it, it remains my fave, though you can't steer emotion as there are no tags. Bizarrely it tends to pick up style/emotion from the surrounding context, so if have:
The man read in charming narrative audiobook style:
"So there he was, standing in the street, wishing that GPUs were cheaper."
It will usually take the delivery style from the initial cue, that i then obviously have to cut from the output.
It also has a bizarre habit of singing if you happen to include song lyrics in text. I couldn't get it to NOT sing "Ground control to Major Nick" - even with it to being "Tom"
SlaveToBuy@reddit (OP)
Thanks for this, I'll give it a shot.
D9scene@reddit
Try OmniVoice, it's quite good and fit into 8GB VRAM
Stepfunction@reddit
I'd second this. OmniVoice is fantastic and generates audio very quickly.
Chocolava@reddit
I'm using an 8GB RAM machine and using OmniVoice to do TTS. It takes about one and a half minute to generate 5s of audio. This feels super slow to me. Do you have any suggestions on how to speed it up?
Stepfunction@reddit
If you're running on CPU, that sounds about right.
Chocolava@reddit
Thanks. Yes, I am on CPU. Unfortunately it is too slow for my purposes.
SlaveToBuy@reddit (OP)
I've checked OmniVoice but there is no emotion. I fed it a minute long reference audio and it sounds very monotone. I'm wondering if there is anything I can change? Will feed it a 10 minute reference give better results?
SignificanceFast8449@reddit
Try Vox CPM2 it is an upgrade: I made a free voice clone software on top of it. freeclone.net
ASMellzoR@reddit
Chatterbox / Chatterbox Turbo / Qwen3 TTS. and such.
Vibevoice is high quality, but very slow. Nice for audiobooks but not so much for real-time conversation.
Chatterbox turbo can also emotion tags like
SlaveToBuy@reddit (OP)
I don't need real-time conversation. It's more for audio books so I'm okay waiting.
ASMellzoR@reddit
Vibevoice for sure then ;)
Clean-Appointment684@reddit
chatterbox pretty good on voice cloning imho. give it a try
Sevealin_@reddit
Chatterbox has one-shot cloning that is pretty good. Just needs one clip that's 30~ seconds of audio.
DrMissingNo@reddit
Vibe voice is a solid choice in my experience.
I haven't tried it yet but Mistral's voxtral seems pretty promising too.