Confusing results using Exllamav2 with 72b models on 48gb (Q4/Q8 cache, max_seq_len)

Posted by yuicebox@reddit | LocalLLaMA | View on Reddit | 15 comments

I am running TabbyAPI in Docker on WSL2 on a Windows 11 desktop, with a RTX4090 and an RTX3090 (48GB total VRAM). It has mostly been amazing, but I've had some issues with larger models that I don't understand at all. I was able to get Magnum v2 72b 4.0bpw to run at 8-15 tokens/second, and it seems like the two critical config changes were: 1. Changing cache\_mode from Q4 to Q8. 2. Reducing max\_seq\_len much more than expected **Full results:** * Q8 cache\_mode, 24576 max\_seq\_len = \~8-15 tk/s, 5.5gb unused VRAM on 3090 * Q8 cache\_mode, *32768* max\_seq\_len = \~2-4 tk/s, 4.5gb unused VRAM on 3090 * *Q4* cache\_mode, 24576 max\_seq\_len = \~2-4 tk/s, 7.9gb unused VRAM on 3090 * Q4 cache\_mode, *32768* max\_seq\_len = \~2-4 tk/s, 7.1gb unused VRAM on 3090 I would've thought Q4 cache would be faster, but it is definitely not. I also would've thought I could use a higher max\_seq\_len since I have a VRAM to spare, but inference speed tanks if I do. Can anyone help me understand this? Are there any other config settings I should edit?