FP16 on Qwen 3.6 27B
Posted by Forward_Jackfruit813@reddit | LocalLLaMA | View on Reddit | 20 comments
Have there been any notable difference between Q8 and FP16 on both the weights and the cache? I know the jump to Q8 is significant. I would test myself, but FP16 on my setup is painfully slow.
Also side question, is \~14TPS around the number I should be expecting on a Strix Halo running 3.6 27B at Q8 during coding tasks? I have my MTP max draft set to 3 and it seems to be slightly better than 2 which runs around \~11.
Another side note in case if you haven't ran into it, 27B is way better when context is below 100k. From my use it appears to finish specifically above 100k which was causing my issues initially.
Look_0ver_There@reddit
For the weights, Q8_0 is generally fine quality wise. If you want a better middle ground without going full F16 for the weights, then use the Unsloth Q8_K_XL quantization as this keeps individual weight blocks that need the higher precision at F16 instead of them all being Q8. It's kind of the best of both worlds in that way
For the KV cache though, you absolutely want to keep that at FP16 (the default) for best results. Try experimenting with either F16 or BF16. Some software runs them at equal speeds while some may run BF16 slower. That's a case by case basis.
Forward_Jackfruit813@reddit (OP)
Thank you for the information, I will try 16. Are you using ROCm or Vulkan? I'm on Vulkan myself.
kant12@reddit
If you're building llama.cpp yourself and want to give rocm 7.13 a try I'd suggesting giving these build options a shot and set ROCBLAS_USE_HIPBLASLT=1 before starting llama-server. For my work I'd say rocm is significantly better at this point.
Look_0ver_There@reddit
I generally run Vulkan too. Lately though (as in the last few days) llama.cpp has had a number of good ROCm performance boosting PRs merged, and it's no longer so cut and dried as it once was. I don't have a definitive answer for you at the moment as to which is better on the Strix Halo yet as I've been focusing on tuning for my R9700's for which ROCm is now almost always better than Vulkan, and that's something that happened 3 days ago.
Grab the Lemonade pre-built ROCm images for the Strix Halo and compare them to the standard llama.cpp Vulkan versions and see which works best. Lately that story seems to be changing almost daily.
Ok_Needleworker_6431@reddit
Q8 vs FP16 on weights — don't bother. Q8 is within spitting distance of FP16 on perplexity for anything real. The cliff is below Q4, not above Q8. You're doubling memory and halving speed for a difference you can't measure.
ai_without_borders@reddit
for coding specifically the calibration matters more than raw bit width. unsloth q8_k_xl holds up because it was calibrated on code. generic imatrix q8_0 without code in the calibration can actually underperform a good q6. kv cache is a separate question; fp16 there does help for long contexts where accumulated error in attention patterns starts to show. the mtp observation is interesting too, accepting speculative draft variance changes the math on base weight precision
Blues520@reddit
I've also been wondering this. I'm running dual 3090's and wondering if it would be worth picking up another one or two to run at FP16.
sleepingsysadmin@reddit
Alex Ziskind just made this video,
Better graph though:
#cant post pictures or links??? what
Essentially, unsloth maintains accuracy the best.
Jury still out of the newer stuff like QAT and autorounds.
>Also side question, is \~14TPS around the number I should be expecting on a Strix Halo running 3.6 27B at Q8 during coding tasks?
That's a separate issue. Strix Halo doesnt do dense models well. That's expected. You probably want to go to 8 or 16b 35b.
You are eagerly awaiting a \~122b model that jumps these models forward.
>Another side note in case if you haven't ran into it, 27B is way better when context is below 100k. From my use it appears to finish specifically above 100k which was causing my issues initially.
All models slow down at higher context. Deepseek and allegedly minimax m3 is going to change this. I expect the frontier closed labs handle this well as well. Not meaningful to you.
The thing you arent taking into account.
Even if you have 200,000 context. Smaller models are silently crashing out on these.
Minimax 2.7 or qwen3.6 27b has 200,000 context, but it's forgetting about 30% of that context at those longer sizes.
GPT 120b, its more like 50%. GPT20b is more like 70%.
Newer attention techniques are getting better but realistically just because you have 256k context, doesnt mean you can really use it.
mindwip@reddit
What do you recommend as max? 64k or 100k with these 27b or 35b qwen?
Long_comment_san@reddit
Jeez, can we please have something something HBM4 64gb GPU for under 3000$ so we can help each other on reddit? It's not that much to ask
ziphnor@reddit
I think the only correct answer to this is some actual benchmarks, there is too much "gut feeling" when it comes to comparing models and quants 😄
Evgeny_19@reddit
I did notice a difference between bf16 and q8 on 35b a3b. It's very evident even on a chess test that was posted here not so long ago. On 27b even ud_q6_k_xl looks very good for my tasks. BF16 is just too slow, to give it a proper run.
Its_Powerful_Bonus@reddit
With MTP difference between quants is not so massive 👌
Demonicated@reddit
As a coding agent it absolutely makes a difference. I will drop down to q8 for text analysis tasks of smaller size but otherwise I ran the full thing.
StableLlama@reddit
The jump from FP16 to Q8 is usually seen as neglectable, quality wise.
It's the smaller quants where the quality is changing, with a good Q4 often still being acceptable. Below Q4 is where differences are getting noticeable.
Bulky-Priority6824@reddit
Negligible*
StableLlama@reddit
Damn, now it's debunked that I'm not an AI agent
Bulky-Priority6824@reddit
I didnt want to do it but I said hello he's probably way better at math than me lol
Xp_12@reddit
neglectigible*
Herr_Drosselmeyer@reddit
In Q8, a weight can have 256 different states, in FP16, it can have 65,536 different states.
So you have a lot more granularity. How much does this help? Hard to say. Most people would say that Q8 is good enough, I tend to agree.