Turboquant+MTP for ROCm(Llama CPP)

Posted by DrBearJ3w@reddit | LocalLLaMA | View on Reddit | 14 comments

TL;DR: I got TBQ4 KV cache + MTP working on AMD ROCm for RX 7900 XTX / RDNA3 / gfx1100 in llama.cpp. Main win: 64k context fits on 24 GB VRAM and remains usable.

Branch: tbq4-rdna3-experiment (https://github.com/DrBearJew/llama.cpp/tree/tbq4-rdna3-experiment)

I dug into TurboQuant / TBQ4 + MTP on AMD because the existing AMD paths were incomplete or broken for my setup. This branch uses the ROCm VEC Flash Attention path with inline TBQ4 dequant.

Test setup:

- RX 7900 XTX, 24 GB

- RDNA3 / gfx1100

- ROCm / HIP

- Qwen3.6-27B Q4_K_M MTP GGUF

- tbq4_0 KV cache

- MTP with --spec-draft-n-max 3

Current numbers:

- tbq4_0, 64k ctx: 38–54 tok/s, \~20 GB VRAM

- tbq4_0, 16k ctx: 36–38 tok/s, \~17 GB VRAM

- Prefill: 537.7 tok/s at 16k; 360.8 tok/s in the 64k test

- q8_0 baseline: \~49.8 tok/s at 16k, \~31 tok/s at 32k, \~22–23 GB VRAM

Caveats:

- RX 7900 XTX is RDNA3 / gfx1100, not RDNA3.5.

- RDNA3.5 / RDNA4 are enabled but untested.

- RotorQuant / PlanarQuant / IsoQuant are present but not validated.

- These are reported points from separate runs, not a clean scaling curve.

Happy for New Testers.

Useful bug reports > hype.

[-]

mmhorda@reddit

I managed to run it with vulkan + mtp no turboquant, 64k context + vision and it gives me about 50t/s sometimes 1-2 tokens more sometimes, 1-2 tokens less depends. memory stays about 22gb, same GPU.
Try vulkan it seems to be significally faster. also i use MTP with --spec-draft-n-max 2, - 3 seems to be weird. especially on long prompts it is noticibly slower.

[-]

Anbeeld@reddit

Q4 + 64k context in 24 GB? It can do much better.

[-]

DrBearJ3w@reddit (OP)

Turboquant has almost the same compression as q4,but of a quality like f8. It think 128k is possible. On such small GPU's such as 7900 xtx more would be too slow. But with 2GPUS it's nice. About 2GB per 16k cache.

[-]

Anbeeld@reddit

I mean Q4 model. Why's everyone obsessed with cache quality while running model that's dumbed down this much...

I'm using Q5 + 120k + DFlash on a single 3090 as a safe option, but previously confirmed it can do 200k on fresh Windows restart despite it stealing some VRAM.

I made a fork for that but I don't possess AMD currently so not sure how it works there. Someone made a related PR but still. https://github.com/Anbeeld/beellama.cpp

[-]

DrBearJ3w@reddit (OP)

This fork should work Cross AMD/Nvidia. D-Flash stagnates past 4k pretty fast and is very Vram hungry. But I don't have any other option as AMD user other than HipFire. So good for you if it works 😀

[-]

Anbeeld@reddit

Bro I literally told you I can run Q5 + 200k cache in 24 GB and you hit me with "DFlash is very VRAM hungry"?

[-]

CryptoStef33@reddit

3090=/ 7900xtx

[-]

Anbeeld@reddit

Q5 suggenly takes up 30 GB if you're on AMD or what?

[-]

nasone32@reddit

Yeah with turboquant or Q4 KV you are be able to do much more than 64k, could you try how much? out of curiosity, not that it's really usable.

Because I think 64k is borderline doable with Q8-Q8. I use 56k with Q8 Q8 and works fine.

Two things I read somewhere that might be useful

1) looks like latest Llamacpp builds already have vector rotations similar to what turboquant is doing, so in reality Q4 KV is very comparable to turboquant but faster. So I'm not sure turboquant is really better. Need to verify this. If you don't want Q4 because of old tradition, you might want to verify this.

2) quantization impact is much worse on K than V, so one option is to go Q8 K and Q4 V, if you don't need extremely long context. also potentially a bit faster. Still things I read around, not tested by me.

[-]

DrBearJ3w@reddit (OP)

Feel free to benchmark it against upstream q4_0/q4_0 and q8_0/q4_0 KV cache. llama.cpp already has solid quantized KV support, so TBQ4 needs to prove a real advantage: longer context, better decode speed, better quality, or better ROCm behavior. I will report later when i got time.

[-]

Inevitable-Log5414@reddit

The Vulkan-until-32k, ROCm-TBQ4-past-that split is a legit niche - Vulkan doesn't have a TBQ4 KV cache path, so once you cross the VRAM wall there's literally no Vulkan option. Underrated work. Will try to test the branch on my XTX and file useful bugs rather than vibes.

[-]

DrBearJ3w@reddit (OP)

Thanks. Appreciate any feedback.

[-]

Formal-Exam-8767@reddit

Thanks for sharing.

How does it compare to Vulkan?

[-]

DrBearJ3w@reddit (OP)

vulkan should be faster till 32k.Turboquant is very Vram friendly and i just dont like q_4 KV cache.