Qwen3 27B FP8 + TurboQuant on RTX 5090 - anyone tried?
Posted by Clasyc@reddit | LocalLLaMA | View on Reddit | 23 comments
Do I understand correctly, based on this comment, that I can potentially fit Qwen 3 27B FP8 precision model and have around 256K context available and fit it fully in my RTX 5090 VRAM? Of course with the help of TurboQuant compression, at what state is it now in llama.cpp, is it usable, has anyone tried?
dampflokfreund@reddit
Just use q4_0 on llama.cpp. It uses rotations now and its better than Turbo Quant.
YourNightmar31@reddit
Wait i don't understand. Q4 is still much lower quality with long context than Q8. You're saying it's better than turboquant? With the Tom fork we had the `turbo3` option which was comparable to Q8 for way smaller vram, what is the same option now then? Does Q8 now in llama.cpp also use way less vram than Q8 did before? If so, when did this change?
lolwutdo@reddit
any idea if it's better to use higher quant model and 4_0 kv or lower quant model with 8_0 kv?
edsonmedina@reddit
does that also apply to Q8_0?
dampflokfreund@reddit
Yup.
edsonmedina@reddit
Do you have a link? Would love to know more.
dampflokfreund@reddit
https://github.com/ggml-org/llama.cpp/pull/21038
edsonmedina@reddit
Interesting.
This seems to improve quant degradation. But doesn't replace Turbo Quant (ie: long context efficiency)
No_Algae1753@reddit
Yes me too!
fragment_me@reddit
Yeah I’ve yet to see turboquant benchmarks of perplexity and kld beat llama cpp q4_0. And FYI to OP, the rotation dampflok is talking about is enabled by default.
Dany0@reddit
RotorQuant was promising but it's all hit and miss at best imo
There is a philosophical/mathematical argument to make for ternary being the most efficient way to store and compute LLMs. So in principle anything that gets us closer to that, IF we were to be convinced by the argument, is good
But for now I'll stick to attn-rot. I like it. It's like an old boot, you know it well
No_Algae1753@reddit
Since when do they use rotations?
shansoft@reddit
I am using it with TheTom's turbo quant variant and I can put up with 260k context windows while using unsloth 3.6 27B UD5. using turbo4 setting.
YourNightmar31@reddit
I'm using spiritbuun/llama-cpp-turboquant-cuda in llama-swap (https://gist.github.com/konradish/35b7f1a7e97f0b9aeb866226c5880d34) but turboquant doesnt seem to be working :(
I asked Claude Sonnet to inspect my logs and it says this:
Hytht@reddit
FP8 model itself takes more than 30GB in VRAM.
WarmRestart157@reddit
is there a significant difference between fp8 and Q4? Also what about Q5 and Q6?
Hytht@reddit
Q4 is almost half the size for somewhat reduced accuracy. It's not that bad with large models. You can check the total file size in huggingface for each quant to roughly know how much VRAM the weights will take.
Clasyc@reddit (OP)
Ahh, for some reason I had wrong estimations in my head.
b1231227@reddit
I'm using two RTX 3060 12G cards. Below are my model parameters; it runs quite well, and I'm using it for coding.
u/echo off
F:\ai_system\llama-cpp-turboquant-win-cuda\build\bin\llama-server.exe \^
-m F:\ai_models\Qwen3.5\27B\Qwopus3.5-27B-v3.5-Q4_K_M.gguf \^
-ngl 99 \^
-ts 49,51 \^
-c 131072 \^
-n 8192 \^
-b 2048 \^
-np 1 \^
-ctk q8_0 \^
-ctv turbo3 \^
-fa auto \^
--temp 0.66 \^
--top-k 20 \^
--top-p 0.95 \^
--min-p 0.0 \^
--repeat-penalty 1.0 \^
--repeat-last-n 64 \^
--xtc-probability 0 \^
--mirostat 0 \^
--samplers "top_k;top_p;temperature" \^
--reasoning on \^
--reasoning-budget 1024 \^
--seed -1 \^
--host 0.0.0.0 \^
--port 8000 \^
--jinja \^
--no-warmup
autisticit@reddit
What are your tokens/s ? Thanks
b1231227@reddit
9.8 t/s
b1231227@reddit
Qwopus3.5-27B-v3.5-Q4_K_M 11.6 t/s
Ok-Internal9317@reddit
My final dream rig is now one that can run 30B sized dense model at fp16 at 100tok/s 1500TP/s
I don't think there is any need for anything beyond that at this pace of development