Qwen3 27B FP8 + TurboQuant on RTX 5090 - anyone tried?

Posted by Clasyc@reddit | LocalLLaMA | View on Reddit | 23 comments

Do I understand correctly, based on this comment, that I can potentially fit Qwen 3 27B FP8 precision model and have around 256K context available and fit it fully in my RTX 5090 VRAM? Of course with the help of TurboQuant compression, at what state is it now in llama.cpp, is it usable, has anyone tried?

[-]

dampflokfreund@reddit

Just use q4_0 on llama.cpp. It uses rotations now and its better than Turbo Quant.

[-]

YourNightmar31@reddit

Wait i don't understand. Q4 is still much lower quality with long context than Q8. You're saying it's better than turboquant? With the Tom fork we had the `turbo3` option which was comparable to Q8 for way smaller vram, what is the same option now then? Does Q8 now in llama.cpp also use way less vram than Q8 did before? If so, when did this change?

[-]

lolwutdo@reddit

any idea if it's better to use higher quant model and 4_0 kv or lower quant model with 8_0 kv?

[-]

edsonmedina@reddit

does that also apply to Q8_0?

[-]

dampflokfreund@reddit

Yup.

[-]

edsonmedina@reddit

Do you have a link? Would love to know more.

[-]

dampflokfreund@reddit

https://github.com/ggml-org/llama.cpp/pull/21038

[-]

edsonmedina@reddit

Interesting.
This seems to improve quant degradation. But doesn't replace Turbo Quant (ie: long context efficiency)

[-]

No_Algae1753@reddit

Yes me too!

[-]

fragment_me@reddit

Yeah I’ve yet to see turboquant benchmarks of perplexity and kld beat llama cpp q4_0. And FYI to OP, the rotation dampflok is talking about is enabled by default.

[-]

Dany0@reddit

RotorQuant was promising but it's all hit and miss at best imo

There is a philosophical/mathematical argument to make for ternary being the most efficient way to store and compute LLMs. So in principle anything that gets us closer to that, IF we were to be convinced by the argument, is good

But for now I'll stick to attn-rot. I like it. It's like an old boot, you know it well

[-]

No_Algae1753@reddit

Since when do they use rotations?

[-]

shansoft@reddit

I am using it with TheTom's turbo quant variant and I can put up with 260k context windows while using unsloth 3.6 27B UD5. using turbo4 setting.

[-]

YourNightmar31@reddit

I'm using spiritbuun/llama-cpp-turboquant-cuda in llama-swap (https://gist.github.com/konradish/35b7f1a7e97f0b9aeb866226c5880d34) but turboquant doesnt seem to be working :(

llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256

I asked Claude Sonnet to inspect my logs and it says this:

Those attn_rot_k = 0 values mean the rotation matrices aren't actually being applied to the KV tensors the way they should be. Compare to q8_0 where both are = 1. Something in the Qwen3.6-27B architecture (likely the hybrid SSM/GDN layers interleaved with attention every 4 layers, full_attention_interval = 4) is confusing the TCQ implementation, causing it to fall back to an unoptimized kernel path that hammers SM at 100% doing essentially nothing useful.

[-]

Clasyc@reddit (OP)

Ahh, for some reason I had wrong estimations in my head.

[-]

b1231227@reddit

I'm using two RTX 3060 12G cards. Below are my model parameters; it runs quite well, and I'm using it for coding.

u/echo off

F:\ai_system\llama-cpp-turboquant-win-cuda\build\bin\llama-server.exe \^

-m F:\ai_models\Qwen3.5\27B\Qwopus3.5-27B-v3.5-Q4_K_M.gguf \^

-ngl 99 \^

-ts 49,51 \^

-c 131072 \^

-n 8192 \^

-b 2048 \^

-np 1 \^

-ctk q8_0 \^

-ctv turbo3 \^

-fa auto \^

--temp 0.66 \^

--top-k 20 \^

--top-p 0.95 \^

--min-p 0.0 \^

--repeat-penalty 1.0 \^

--repeat-last-n 64 \^

--xtc-probability 0 \^

--mirostat 0 \^

--samplers "top_k;top_p;temperature" \^

--reasoning on \^

--reasoning-budget 1024 \^

--seed -1 \^

--host 0.0.0.0 \^

--port 8000 \^

--jinja \^

--no-warmup

[-]