Confusing results using Exllamav2 with 72b models on 48gb (Q4/Q8 cache, max_seq_len)

Posted by yuicebox@reddit | LocalLLaMA | View on Reddit | 15 comments

I am running TabbyAPI in Docker on WSL2 on a Windows 11 desktop, with a RTX4090 and an RTX3090 (48GB total VRAM). It has mostly been amazing, but I've had some issues with larger models that I don't understand at all. I was able to get Magnum v2 72b 4.0bpw to run at 8-15 tokens/second, and it seems like the two critical config changes were: 1. Changing cache\_mode from Q4 to Q8. 2. Reducing max\_seq\_len much more than expected **Full results:** * Q8 cache\_mode, 24576 max\_seq\_len = \~8-15 tk/s, 5.5gb unused VRAM on 3090 * Q8 cache\_mode, *32768* max\_seq\_len = \~2-4 tk/s, 4.5gb unused VRAM on 3090 * *Q4* cache\_mode, 24576 max\_seq\_len = \~2-4 tk/s, 7.9gb unused VRAM on 3090 * Q4 cache\_mode, *32768* max\_seq\_len = \~2-4 tk/s, 7.1gb unused VRAM on 3090 I would've thought Q4 cache would be faster, but it is definitely not. I also would've thought I could use a higher max\_seq\_len since I have a VRAM to spare, but inference speed tanks if I do. Can anyone help me understand this? Are there any other config settings I should edit?

Reply to Post

15 Comments

[-]

JockY@reddit

The more you quantize the cache, the more compute it takes. In your example FP16 KV cache is fastest; Q8 is in the middle; Q4 is slowest. You'll get much more speed out of Tabby/exl2 by enabling tensor parallel and speculative decoding (I use a very similar setup with 2x RTX A6000s and an RTX5000). I can get 40 tokens/second with Qwen2.5 72B Instruct exl2 quantized to 8.0bpw and FP16 KV cache. Hints for tensor parallel: in addition to setting `tensor_parallel: true` you _must_ set the `gpu_split` config option (and `draft_gpu_split` if you're using speculative decoding) otherwise you don't get tensor parallel, you get some other GPU split that isn't as fast. My GPUs have 32, 48 and 48GB of VRAM so I set these options (in addtion to the others, of course): tensor_parallel: true gpu_split_auto: false gpu_split: [29,45,45] draft: draft_gpu_split: [3,3,3] I'm using the Qwen2.5 3B Instruct for the draft model. Good luck!

[-]

yuicebox@reddit (OP)

This is really helpful explanation regarding the cache\_mode, thank you. I am still surprised that increasing context from 24k to 32k causes my performance to drop so much considering I have over 4gb of unused VRAM. Does that seem right to you? I used tensor\_parallel a while back but I have it disabled currently. Does tensor parallel effectively double VRAM requirements by loading the full model twice? Haven't tried speculative decoding at all yet so I will have to read up on that and test.

[-]

JockY@reddit

I'm surprised all the time by the crazy unexpected things that come from working with this technology. It's like digital witchcraft sometimes. Tensor parallel doesn't double VRAM requirements, it's more like a strategy for splitting a large model (that wouldn't fit on a single GPU) across the combined shared memory of multiple GPUs. Speculative decoding is pretty great and it speeds things up a lot. My sweet spot is to pair Qwen2.5 3B Instruct with 72B Instruct. I tried the 1.5B as the draft model, but it lost its shit too often and would lapse into repetition loops. The 3B has been solid. Everything 8.0bpw.

[-]

yuicebox@reddit (OP)

Okay, you have now truly blown my mind because it appears that by enabling tensor parallelism, Q4 caching no longer noticeably reduces inference speeds

[-]

JockY@reddit

That’s great news, you’re very welcome. Wait til you try speculative decoding… ;)

[-]

yuicebox@reddit (OP)

Alright, this blew my mind a little bit. I enabled tensor parallelism and the VRAM use across both cards seems to be the same, but it looks like the 3090 is doing some of the calculations during inference, where before I think the 4090 was maybe doing all of it. I'm not sure I understand exactly how the 4090 was doing calculations on tensors stored in the memory of the 3090. I guess that would need to get passed from the 3090 to the 4090 in some way, but I don't know the exact path or order of operations there. More testing to be done, but seems to be a small but definite improvement to tok/sec, even with the 3090 being a slower card than the 4090.

[-]

nite2k@reddit

for your draft model 3b instruct, what quant are you using e.g. q8..q4 etc?

[-]

JockY@reddit

8.0bpw exl2

[-]

nite2k@reddit

got it thanks!

[-]

IllSkin@reddit

Out of curiosity, what happens when you use Q6 cache? Performance like Q8 or Q4?

[-]

yuicebox@reddit (OP)

It seems like Q6 performs better than Q4, but not as good as Q8. Interestingly, all of my problems have seemingly been resolved by just enabling tensor_parallel. I am now getting fine inference speed and able to use higher context length regardless of cache type.

[-]

knownboyofno@reddit

You changed the cache mode from Q8 to Q4. Have you set both to the same cache mode?

[-]

yuicebox@reddit (OP)

what do you mean when you say "set both"?

[-]

knownboyofno@reddit

You can set the cache to be anything like Q4, Q6, etc. So, I meant to set the cache mode to Q8 for the QwQ 32B 4bit.

[-]

JockY@reddit

Yes, quantizing to Q4 requires more compute than Q8, which makes Q4 slower.