FP16 on Qwen 3.6 27B

Posted by Forward_Jackfruit813@reddit | LocalLLaMA | View on Reddit | 20 comments

Have there been any notable difference between Q8 and FP16 on both the weights and the cache? I know the jump to Q8 is significant. I would test myself, but FP16 on my setup is painfully slow.

Also side question, is \~14TPS around the number I should be expecting on a Strix Halo running 3.6 27B at Q8 during coding tasks? I have my MTP max draft set to 3 and it seems to be slightly better than 2 which runs around \~11.

Another side note in case if you haven't ran into it, 27B is way better when context is below 100k. From my use it appears to finish specifically above 100k which was causing my issues initially.

[-]

Look_0ver_There@reddit

For the weights, Q8_0 is generally fine quality wise. If you want a better middle ground without going full F16 for the weights, then use the Unsloth Q8_K_XL quantization as this keeps individual weight blocks that need the higher precision at F16 instead of them all being Q8. It's kind of the best of both worlds in that way

For the KV cache though, you absolutely want to keep that at FP16 (the default) for best results. Try experimenting with either F16 or BF16. Some software runs them at equal speeds while some may run BF16 slower. That's a case by case basis.

[-]

Forward_Jackfruit813@reddit (OP)

Thank you for the information, I will try 16. Are you using ROCm or Vulkan? I'm on Vulkan myself.

[-]

kant12@reddit

If you're building llama.cpp yourself and want to give rocm 7.13 a try I'd suggesting giving these build options a shot and set ROCBLAS_USE_HIPBLASLT=1 before starting llama-server. For my work I'd say rocm is significantly better at this point.

-DGGML_HIP=ON -DGGML_HIP_ROCWMMA_FATTN=OFF -DGGML_CUDA_FORCE_CUBLAS=OFF -DGGML_BMI2=ON -DGGML_FMA=ON -DGGML_F16C=ON

[-]

Look_0ver_There@reddit

I generally run Vulkan too. Lately though (as in the last few days) llama.cpp has had a number of good ROCm performance boosting PRs merged, and it's no longer so cut and dried as it once was. I don't have a definitive answer for you at the moment as to which is better on the Strix Halo yet as I've been focusing on tuning for my R9700's for which ROCm is now almost always better than Vulkan, and that's something that happened 3 days ago.

Grab the Lemonade pre-built ROCm images for the Strix Halo and compare them to the standard llama.cpp Vulkan versions and see which works best. Lately that story seems to be changing almost daily.

[-]

Ok_Needleworker_6431@reddit

Q8 vs FP16 on weights — don't bother. Q8 is within spitting distance of FP16 on perplexity for anything real. The cliff is below Q4, not above Q8. You're doubling memory and halving speed for a difference you can't measure.

[-]

ai_without_borders@reddit

for coding specifically the calibration matters more than raw bit width. unsloth q8_k_xl holds up because it was calibrated on code. generic imatrix q8_0 without code in the calibration can actually underperform a good q6. kv cache is a separate question; fp16 there does help for long contexts where accumulated error in attention patterns starts to show. the mtp observation is interesting too, accepting speculative draft variance changes the math on base weight precision

[-]

Blues520@reddit

I've also been wondering this. I'm running dual 3090's and wondering if it would be worth picking up another one or two to run at FP16.

[-]

sleepingsysadmin@reddit

Alex Ziskind just made this video,

Better graph though:

#cant post pictures or links??? what

Essentially, unsloth maintains accuracy the best.

Jury still out of the newer stuff like QAT and autorounds.

>Also side question, is \~14TPS around the number I should be expecting on a Strix Halo running 3.6 27B at Q8 during coding tasks?

That's a separate issue. Strix Halo doesnt do dense models well. That's expected. You probably want to go to 8 or 16b 35b.

You are eagerly awaiting a \~122b model that jumps these models forward.

>Another side note in case if you haven't ran into it, 27B is way better when context is below 100k. From my use it appears to finish specifically above 100k which was causing my issues initially.

All models slow down at higher context. Deepseek and allegedly minimax m3 is going to change this. I expect the frontier closed labs handle this well as well. Not meaningful to you.

The thing you arent taking into account.

Even if you have 200,000 context. Smaller models are silently crashing out on these.

Minimax 2.7 or qwen3.6 27b has 200,000 context, but it's forgetting about 30% of that context at those longer sizes.

GPT 120b, its more like 50%. GPT20b is more like 70%.

Newer attention techniques are getting better but realistically just because you have 256k context, doesnt mean you can really use it.

[-]