Qwen3.6-27B Quantization Benchmark

Posted by bobaburger@reddit | LocalLLaMA | View on Reddit | 74 comments

Hi everyone! This is my attempt to benchmark and compare the quality of some of the well known Qwen3.6 27B quantizations on HuggingFace (unsloth, mradermacher, IQ4\_XS from cHunter789 and Ununnilium), from Q8 all the way down to Q2. # Measurement method I'm using llama.cpp's `llama-perplexity` to measure the **mean KLD** and **Same Top P Percentage** between the quantized model and the base (BF16 version). All runs were using the same context length of 8192 tokens, KV cache quantized to q8\_0 so I can make sure the entire model fit in the GPU. # Understand KLD and Same Top P To understand the test result, it would be useful to understand the difference between the two metrics I used. When an LLM predicts the next word of a given prompt, for example **"Today I will do my"**, it looks at its entire vocabulary and assigns a confidence score to every single token. Then samples the top tokens and pick the final one, based on the given temperature. * **KL Divergence (KLD)** measures how much the confidence distribution of the quantized model drifts away from the base. In this example, the base model might assign 90% confidence to "homework", 5% to "bike" and 1% to "banana". But the poorly quantized one might give 50% to "homework", 30% to "bike" and "20%" to "banana". * **Same Top P** tracks how often the quantized model picks the same token as the base model. In this example, the model might just pick "homework" as the next token for the prompt. So, while you might get a good token choice with the quantized model (**Same Top P** is high), it's important to look at the **Mean KLD** to see how stable the inner probability of the model is, the lower, the better. # Benchmark result # Unsloth's quantization https://preview.redd.it/awcfprb5744h1.png?width=3600&format=png&auto=webp&s=3ac8937eeac49b6b4d3920cd2b4b52e99a25e269 Nothing special, higher quants are better than lower quants. Q6 to Q8 are pretty much lossless. You can see Q8\_0 has a higher **Same Top P**, but underlying, the **Mean KLD** tells us that UD-Q8\_K\_XL is better. Anything below Q4 are for the desperate, like the 5060ti 16GB club. The 4-bit cluster is a bit more interesting. Different people may have a different take on this, but to me, Q4\_K\_XL is a good quality-compromise if you can afford the VRAM. If you're tight, IQ4\_XS could serve you well, IQ4\_NL is not much difference. And in that case, there's no need to stretch for Q4\_K\_M. You can skip Q4\_K\_S. From Q3\_K\_XL, the quality degradation is more drastic. The KLD went all above 0.1 and matching token selection dropped to 90-85% can tell a lot about the instability. # mradermacher's and other quants I've seen people mention mradermacher's i1 quants here and there, and also IQ4\_XS quants from cHunter789 and Ununnilium. I have been personally using Ununnilium's IQ4\_XS for a while now. So I want to put them all on the same table to see how they fit. But a single diagram will not be enough so I will break them into 4 groups: Q8-Q6, Q5, Q4 and Q3-below. # 8-bit and 6-bit quantization https://preview.redd.it/6om7k1x6744h1.png?width=1600&format=png&auto=webp&s=28c6b79b867976de16a01b39b5dd20d422d77762 mradermacher's Q6\_K seems to be a clear winner over Unsloth's Q6\_K here. The mean KLD is near perfect (0.027352), and 97.011% token selection match. # 5-bit quantization https://preview.redd.it/j7cs0cs7744h1.png?width=1600&format=png&auto=webp&s=8a8ba0e99a2c275034de0d7ebb357c1adfbed7cd In this group, Unsloth is a winner. With about 300-500MB difference in size, you can skip Q5\_K\_S and go for Q5\_K\_M. Unsloth's Q5\_K\_M is clearly better in both matching token selection and KLD. # 4-bit quantization https://preview.redd.it/ywleki49744h1.png?width=3300&format=png&auto=webp&s=5db6b1d3899171afad5093557f849539332ea33d Unsloth beats all of the 4-bit quants here. But if you are looking for some alternative quants to save VRAM, like ones on 16GB, pay attention to IQ4\_XS (it will help but of course, you will not be able to get above 65k context window). mradermacher's IQ4\_XS is a clear winner among all the other IQ4\_XS quants, but at 15.1 GB, it would be a bit tight. cHunter's IQ4\_XS is also very good at 14.7 GB. # 3-bit and below https://preview.redd.it/fgjixv7a744h1.png?width=3300&format=png&auto=webp&s=45d85e85e57cfb7da11fbff2b5f4172634e20a1e Again, mradermacher's quants filled in the gap between Unsloth's quants here, so you get a bit more choice, but tbh, at this range, you better off with Unsloth's Q3\_K\_XL or at least Q3\_K\_M. I was very interested to see how some new quants like IQ3\_S, IQ3\_M perform, but they turned out a bit disappointed. # Raw benchmark data If you are interested, here's the raw benchmark data table after all the run. |Quantization|Mean PPL(Q)|Mean KLD|RMS Δp (%)|Same top p (%)| |:-|:-|:-|:-|:-| |UD-Q8\_K\_XL|6.569706|0.015495|2.448|97.407| |Q8\_0|6.567807|0.020497|2.701|97.753| |UD-Q6\_K\_XL|6.541421|0.023398|2.903|97.436| |mradermacher/Q6\_K|6.541627|0.027352|3.045|97.011| |Q6\_K|6.566514|0.027766|3.014|97.112| |UD-Q5\_K\_XL|6.625155|0.045526|4.021|96.187| |Q5\_K\_M|6.658295|0.05277|4.26|95.864| |mradermacher/Q5\_K\_M|6.630279|0.053246|4.372|95.664| |mradermacher/Q5\_K\_S|6.613859|0.055034|4.476|95.505| |Q5\_K\_S|6.652629|0.055888|4.414|95.674| |UD-Q4\_K\_XL|6.647006|0.06656|5.023|94.621| |Q4\_K\_M|6.672841|0.070345|5.334|94.228| |IQ4\_NL|6.619131|0.071724|5.497|94.106| |IQ4\_XS|6.61994|0.072223|5.481|94.016| |mradermacher/IQ4\_XS|6.611545|0.073705|5.648|93.852| |mradermacher/Q4\_K\_M|6.685347|0.074124|5.507|94.08| |cHunter/IQ4\_XS-i1|6.656157|0.075933|5.645|93.77| |Q4\_K\_S|6.690623|0.078947|5.72|93.833| |mradermacher/Q4\_K\_S|6.642023|0.080407|5.825|93.657| |Ununnilium/IQ4\_XS-pure|6.765894|0.084115|6.127|92.407| |UD-Q3\_K\_XL|6.620281|0.105386|7.077|91.837| |Q3\_K\_M|6.453757|0.129404|7.893|90.437| |mradermacher/Q3\_K\_L|6.482496|0.136127|8.116|90.213| |mradermacher/Q3\_K\_M|6.481299|0.140487|8.424|89.934| |mradermacher/IQ3\_XS|6.981601|0.161364|9.182|88.767| |UD-IQ3\_XXS|6.994512|0.176688|9.626|87.953| |mradermacher/IQ3\_S|7.405328|0.176782|9.637|88.689| |Q3\_K\_S|7.068685|0.178631|9.61|87.681| |mradermacher/IQ3\_M|7.454224|0.180647|9.824|88.603| |mradermacher/Q3\_K\_S|6.910989|0.181172|9.82|87.422| |UD-Q2\_K\_XL|7.316461|0.229068|11.399|85.95| |UD-IQ2\_M|7.468708|0.241252|11.91|85.319| |UD-IQ2\_XXS|8.507239|0.40986|16.708|78.483| There are many more Qwen3.6 27B quantizations on HuggingFace, like ones from bartowski, huihui,... within my time budget (not money budget, since I'm basically using modal.com's free monthly credit :P), I cannot benchmark them all. If you are interested in doing your own benchmark, I also attached the script in my original blog post, so you can run it on your own. See it here: [https://www.huy.rocks/everyday/05-29-2026-ai-qwen3-6-27b-quantization-benchmark](https://www.huy.rocks/everyday/05-29-2026-ai-qwen3-6-27b-quantization-benchmark) Would love to see the result if any of you decided to run on your own. Thanks for reading this far!

Reply to Post

74 Comments

[-]

soyalemujica@reddit

It wouldn't hurt to see AutoRound models in this comparison as well

[-]

andrerom@reddit

Would be super helpful to see how HW bits/floats in there for comparison. Notably fp8, int8, mxfp6, mxfp4 and nvfp4

[-]

inrea1time@reddit

I second nvfp4 (even gguf), curious to see how it compares to Q8.

[-]

rpkarma@reddit

Badly, unless it’s quantised carefully and post-trained via QAD

[-]

andrerom@reddit

For me int8 (m5), mxfp6 (google, amd, rumored m6) and nvfp4 (blackwell)

[-]

starkruzr@reddit

> rumored m6 I haven't heard anything about this?

[-]

def_not_jose@reddit

https://www.reddit.com/r/LocalLLM/s/4PbVL3kmKL Actual intelligence tests for some quants, IQ4_NL seems to be pretty good

[-]

audioen@reddit

That is clearly using a saturated benchmark test. I personally find any form of 4-bit Qwen3.6 unusable for actual work, 5-bit makes strange mistakes, 6-bit seems to be quite bad at translating to my niche mother tongue (Finnish), and so I find that only at 8-bit is the model working seemingly properly at all tasks I am using it for. If I'm telling it to design and implement something, and it has to define string constants in UI, and then translate these, I don't want to come back looking at barely intelligible gibberish in the UI, but fluent language. I used to run Aman Gupta's Q8\_0 for a while, and I'm now testing UD-Q8\_K\_XL because I know it's supposed to be slightly better still. I think anyone thinking the model is "good" at 4 bits hasn't really been able to evaluate it at 8 bits, and it is possibly still slightly worse than bf16 is. After all, these charts are showing 2-3 % top token choice difference, so every 25 tokens or so the model then likely differs from what the original would have said. (Assuming that I am interpreting the presented charts correctly.)

[-]

voyager256@reddit

Not everyone uses the 27b model for similar applications/tasks as you. For Many a good Q6_K would be perfectly fine. It’s mostly suited for coding assistance and tool calling. Not everyone uses it to a niche language translation you know?

[-]

jopereira@reddit

Who would be running at Q4 if they can run it at BF16? That's a mute discussion, imo. I run it at IQ3 XXS and it is *very good* for what I do (is also real work xD). But I don't choose IQ3 XXS by whim. I choose because it's the best I can run on my hardware (16Gb VRAM), no other model thinks and solves my coding problem better that this 27B. And fast to! Another problem is to know if the BF16 model predicted token is effectively better that the IQ3 (or Q4, Q6,...) one for every single case. That would require a better benchmark system.

[-]