Confusing results using Exllamav2 with 72b models on 48gb (Q4/Q8 cache, max_seq_len)
Posted by yuicebox@reddit | LocalLLaMA | View on Reddit | 15 comments
I am running TabbyAPI in Docker on WSL2 on a Windows 11 desktop, with a RTX4090 and an RTX3090 (48GB total VRAM). It has mostly been amazing, but I've had some issues with larger models that I don't understand at all.
I was able to get Magnum v2 72b 4.0bpw to run at 8-15 tokens/second, and it seems like the two critical config changes were:
1. Changing cache\_mode from Q4 to Q8.
2. Reducing max\_seq\_len much more than expected
**Full results:**
* Q8 cache\_mode, 24576 max\_seq\_len = \~8-15 tk/s, 5.5gb unused VRAM on 3090
* Q8 cache\_mode, *32768* max\_seq\_len = \~2-4 tk/s, 4.5gb unused VRAM on 3090
* *Q4* cache\_mode, 24576 max\_seq\_len = \~2-4 tk/s, 7.9gb unused VRAM on 3090
* Q4 cache\_mode, *32768* max\_seq\_len = \~2-4 tk/s, 7.1gb unused VRAM on 3090
I would've thought Q4 cache would be faster, but it is definitely not.
I also would've thought I could use a higher max\_seq\_len since I have a VRAM to spare, but inference speed tanks if I do.
Can anyone help me understand this? Are there any other config settings I should edit?
15 Comments
__JockY__@reddit
yuicebox@reddit (OP)
__JockY__@reddit
yuicebox@reddit (OP)
__JockY__@reddit
yuicebox@reddit (OP)
nite2k@reddit
__JockY__@reddit
nite2k@reddit
IllSkin@reddit
yuicebox@reddit (OP)
knownboyofno@reddit
yuicebox@reddit (OP)
knownboyofno@reddit
__JockY__@reddit