VibeVoice quantized to 4 bit and 8 bit with some code to run it...
Posted by teachersecret@reddit | LocalLLaMA | View on Reddit | 22 comments
Was playing around with VibeVoice and saw other people were looking for ways to run it on less than 24gb vram so I did a little fiddling.
Here's a huggingface I put up with the 4 and 8 bit pre-quantized models, getting them to sizes that might be able to be crammed (barely) on an 8 gb vram and 12 gb vram card, respectively (you might have to run headless to fit that 7b in 8gb vram, it's really cutting it close, but both should run -fine- in a 12gb+ card).
VibeVoice 4 bit and 8 bit Quantized Models
I also included some code to test them out, or to quantize them yourself, or if you're just curious how I did this:
https://github.com/Deveraux-Parker/VibeVoice-Low-Vram
I haven't bothered making a Gradio for this or anything like that, but there's some python files in there to test inference and it can be bolted into the existing VibeVoice gradio easily.
A quick test:
https://vocaroo.com/1lPin5ISa2f5
chibop1@reddit
Is that possible to run it without bitsandbytes? Unfortunately bitsandbytes doesn't support MPS for Apple silicon.
geopehlivanov83@reddit
Do you find a way? I have same issue in ComfyUI on Mac
chibop1@reddit
Apparently there's PR for mps, but I haven't tried it.
https://github.com/yhsung/bitsandbytes/pull/1
https://github.com/bitsandbytes-foundation/bitsandbytes/issues/252#issuecomment-3253303043
geopehlivanov83@reddit
how to try this fork? can you share the steps, please!
chibop1@reddit
NO idea, I haven't tried it. You can ask LLM.
Dragonacious@reddit
Can we install this locally?
RocketBlue57@reddit
The 7b model get yanked. If you've got it ...
teachersecret@reddit (OP)
I pulled the 8 bit down because people were saying it was having issues - I haven't had a chance to eyeball/reupload it yet. 4 bit works.
OrganicApricot77@reddit
What’s the inference time
teachersecret@reddit (OP)
A bit faster than realtime on a 4090, and perhaps more importantly, it can stream with super low latency. If you're streaming you'll get the first audio tokens in a few tenths of seconds, so if you're streaming the audio you can get playback almost instantly). 4 bit runs nearly as fast as the f16.
No idea on slower/lower vram gpus, but presumably pretty quick based on what I'm seeing here. This level of quality at low latency is fantastic.
- Fastest performance (0.775x RTF - faster than real-time!)
- Uses 19.31 GB VRAM
- Best for high-end GPUs with 24GB+ VRAM
- Good performance (1.429x RTF - still reasonable)
- Uses only 7.98 GB VRAM
- Significant slowdown (2.825x RTF)
- Uses 11.81 GB VRAM
- The 8-bit quantization overhead makes it slower than 4-bit
- Best for offline batch processing where quality matters more than speed
zyxwvu54321@reddit
Can you share the full code for using the 8-bit model? Like the other commenter, I am only getting empty noise.
teachersecret@reddit (OP)
I'll dig in later and eyeball it, it was working fine on my end, but it's possible I uploaded the wrong inference file for it (I might have uploade my 4 bit script to the 8 bit folder or an older version of the script, I'll have to check when I have a minute).
poli-cya@reddit
Wow, man, unbelievable. Even giving us benchmarks. Is it possible to make an FP8 quant and see how fast it runs on your 4090?
MustBeSomethingThere@reddit
It would be nice to have a longer output sample than 6 seconds
teachersecret@reddit (OP)
Then make one or go look at the existing long samples on vibevoice. I was just trying to quickly share the code/quants in case anyone else was messing with this, since I'd taken the time to make them. They work. Load one up and give it a nice long script.
MustBeSomethingThere@reddit
>Then make one or go look at the existing long samples on vibevoice.
I didn't mean to complain, but my point was that it would be helpful to have a longer output sample. This way, we could compare the output quality to that of the original weights. Some people may hesitate to download several gigabytes without knowing the quality beforehand. This is a common practice.
HelpfulHand3@reddit
I agree. It doesn't help that the sample provided seems to have issues, like it reading out the word "Speaker". What was the transcript? No quick summary of how it seems to perform vs full weights?
teachersecret@reddit (OP)
Just sharing something I did for myself.
I didn’t cherry pick the audio and the error was actually my fault, I didn’t include a new line before speaker 2. Works fine. Shrug! Mess with it or don’t :p.
poli-cya@reddit
Nah, I think it's weird to ask this. The guy has put in a ton of free work and it'd take you almost no time to download and make longer samples to post here in support if you cared that much about longer samples vs what he's provided.
HelpfulHand3@reddit
Your 4bit sample displays the same instability of the 1.5b with random music playing, but the speaking sounds good.
strangeapple@reddit
FYI: added a link to your github in TTS/STT -megathread that I am managing.
Primary-Speaker-9896@reddit
Excellent job! I just managed to run the 4 bit quant on a 6gb RTX 2060 at ~5-6s per iteration. Consumes 6.7gb of VRAM and fills the gap using system RAM. Overall slow, but it's nice seeing it run at all.