VibeVoice quantized to 4 bit and 8 bit with some code to run it...

Posted by teachersecret@reddit | LocalLLaMA | View on Reddit | 22 comments

Was playing around with VibeVoice and saw other people were looking for ways to run it on less than 24gb vram so I did a little fiddling.

Here's a huggingface I put up with the 4 and 8 bit pre-quantized models, getting them to sizes that might be able to be crammed (barely) on an 8 gb vram and 12 gb vram card, respectively (you might have to run headless to fit that 7b in 8gb vram, it's really cutting it close, but both should run -fine- in a 12gb+ card).

VibeVoice 4 bit and 8 bit Quantized Models

I also included some code to test them out, or to quantize them yourself, or if you're just curious how I did this:

https://github.com/Deveraux-Parker/VibeVoice-Low-Vram

I haven't bothered making a Gradio for this or anything like that, but there's some python files in there to test inference and it can be bolted into the existing VibeVoice gradio easily.

A quick test:
https://vocaroo.com/1lPin5ISa2f5

[-]

chibop1@reddit

Is that possible to run it without bitsandbytes? Unfortunately bitsandbytes doesn't support MPS for Apple silicon.

[-]

geopehlivanov83@reddit

Do you find a way? I have same issue in ComfyUI on Mac

[-]

chibop1@reddit

Apparently there's PR for mps, but I haven't tried it.

https://github.com/yhsung/bitsandbytes/pull/1

https://github.com/bitsandbytes-foundation/bitsandbytes/issues/252#issuecomment-3253303043

[-]

geopehlivanov83@reddit

how to try this fork? can you share the steps, please!

[-]

chibop1@reddit

NO idea, I haven't tried it. You can ask LLM.

[-]

Dragonacious@reddit

Can we install this locally?

[-]

RocketBlue57@reddit

The 7b model get yanked. If you've got it ...

[-]

teachersecret@reddit (OP)

I pulled the 8 bit down because people were saying it was having issues - I haven't had a chance to eyeball/reupload it yet. 4 bit works.

[-]

OrganicApricot77@reddit

What’s the inference time

[-]

teachersecret@reddit (OP)

A bit faster than realtime on a 4090, and perhaps more importantly, it can stream with super low latency. If you're streaming you'll get the first audio tokens in a few tenths of seconds, so if you're streaming the audio you can get playback almost instantly). 4 bit runs nearly as fast as the f16.

No idea on slower/lower vram gpus, but presumably pretty quick based on what I'm seeing here. This level of quality at low latency is fantastic.

16-bit Model:

- Fastest performance (0.775x RTF - faster than real-time!)

- Uses 19.31 GB VRAM

- Best for high-end GPUs with 24GB+ VRAM

4-bit Model:

- Good performance (1.429x RTF - still reasonable)

- Uses only 7.98 GB VRAM

8-bit Model:

- Significant slowdown (2.825x RTF)

- Uses 11.81 GB VRAM

- The 8-bit quantization overhead makes it slower than 4-bit

- Best for offline batch processing where quality matters more than speed

[-]

zyxwvu54321@reddit

Can you share the full code for using the 8-bit model? Like the other commenter, I am only getting empty noise.

[-]

teachersecret@reddit (OP)

I'll dig in later and eyeball it, it was working fine on my end, but it's possible I uploaded the wrong inference file for it (I might have uploade my 4 bit script to the 8 bit folder or an older version of the script, I'll have to check when I have a minute).

[-]

poli-cya@reddit

Wow, man, unbelievable. Even giving us benchmarks. Is it possible to make an FP8 quant and see how fast it runs on your 4090?

[-]

MustBeSomethingThere@reddit

It would be nice to have a longer output sample than 6 seconds

[-]

teachersecret@reddit (OP)

Then make one or go look at the existing long samples on vibevoice. I was just trying to quickly share the code/quants in case anyone else was messing with this, since I'd taken the time to make them. They work. Load one up and give it a nice long script.

[-]

MustBeSomethingThere@reddit

>Then make one or go look at the existing long samples on vibevoice.

I didn't mean to complain, but my point was that it would be helpful to have a longer output sample. This way, we could compare the output quality to that of the original weights. Some people may hesitate to download several gigabytes without knowing the quality beforehand. This is a common practice.

[-]

HelpfulHand3@reddit

I agree. It doesn't help that the sample provided seems to have issues, like it reading out the word "Speaker". What was the transcript? No quick summary of how it seems to perform vs full weights?

[-]

teachersecret@reddit (OP)

Just sharing something I did for myself.

I didn’t cherry pick the audio and the error was actually my fault, I didn’t include a new line before speaker 2. Works fine. Shrug! Mess with it or don’t :p.

[-]