Updated Qwen3.5-9B Quantization Comparison
Posted by TitwitMuffbiscuit@reddit | LocalLLaMA | View on Reddit | 102 comments
This is a KLD eval across community GGUF quants of Qwen3.5-9B, comparing mean KLD to the BF16 baseline.
The goal is to give people a data-driven basis for picking a file rather than just grabbing whatever is available.
KLD (KL Divergence): "Faithfulness." It shows how much the quantized model's probability distribution drifts from a baseline (the probability distribution of the original weights). Lower = closer.
PPL (Perplexity): Used to measure the average uncertainty of the model when predicting the next token. It is derived from the total information loss (Cross Entropy). Lower = more confident.
They are correlated. Perplexity measures the total error, KLD measures the relative error (like a routing drift of an MoE model). This relationship helps in determining information loss (or gain when training). Since we are trying to see how much information we've lost and since PPL is noisy as it can get a better score by pure luck, KLD is better as it is not relying on the dataset but on the baseline.
If you need the most faithfull quant, pick the one with the lowest KLD.
[That is a dense plot, sorry about that.]()
KLD RANKINGS
bolded KLD Score <0.01 - lower is better
| Quantization | Size_GiB | PPL_Score | KLD_Score |
|---|---|---|---|
| eaddario/Qwen3.5-9B-Q8_0 | 8.873 | 19.177240 | 0.001198 |
| unsloth/Qwen3.5-9B-UD-Q8_K_XL | 12.083 | 19.183966 | 0.001243 |
| bartowski/Qwen_Qwen3.5-9B-Q8_0 | 8.89 | 19.184374 | 0.001405 |
| lmstudio-community/Qwen3.5-9B-Q8_0 | 8.873 | 19.184470 | 0.001410 |
| ZeroWw/Qwen3.5-9B.q8_p | 8.873 | 19.189372 | 0.001412 |
| unsloth/Qwen3.5-9B-Q8_0 | 8.873 | 19.175181 | 0.001433 |
| AaryanK/Qwen3.5-9B.q8_0 | 8.873 | 19.177790 | 0.001445 |
| DevQuasar/Qwen.Qwen3.5-9B.Q8_0 | 8.873 | 19.186216 | 0.001464 |
| ZeroWw/Qwen3.5-9B.q8_0 | 10.649 | 19.188892 | 0.001679 |
| unsloth/Qwen3.5-9B-UD-Q6_K_XL | 8.156 | 19.193957 | 0.001910 |
| bartowski/Qwen_Qwen3.5-9B-Q6_K_L | 7.592 | 19.202837 | 0.002371 |
| bartowski/Qwen_Qwen3.5-9B-Q6_K | 7.134 | 19.213584 | 0.002813 |
| unsloth/Qwen3.5-9B-Q6_K | 6.946 | 19.200108 | 0.003080 |
| Mungert/Qwen3.5-9B-q6_k_m | 6.872 | 19.235596 | 0.003609 |
| mradermacher/Qwen3.5-9B.i1-Q6_K | 6.854 | 19.234343 | 0.003735 |
| ZeroWw/Qwen3.5-9B.q6_k | 9.089 | 19.259351 | 0.004625 |
| AaryanK/Qwen3.5-9B.q6_k | 6.854 | 19.258445 | 0.004779 |
| DevQuasar/Qwen.Qwen3.5-9B.Q6_K | 6.854 | 19.272393 | 0.004801 |
| lmstudio-community/Qwen3.5-9B-Q6_K | 6.854 | 19.263994 | 0.004905 |
| bartowski/Qwen_Qwen3.5-9B-Q5_K_L | 6.976 | 19.268033 | 0.006068 |
| unsloth/Qwen3.5-9B-UD-Q5_K_XL | 6.281 | 19.260486 | 0.006419 |
| bartowski/Qwen_Qwen3.5-9B-Q5_K_M | 6.392 | 19.274078 | 0.006604 |
| Mungert/Qwen3.5-9B-q5_k_m | 6.336 | 19.263969 | 0.006714 |
| unsloth/Qwen3.5-9B-Q5_K_M | 6.126 | 19.298573 | 0.007290 |
| bartowski/Qwen_Qwen3.5-9B-Q5_K_S | 6.078 | 19.271394 | 0.008110 |
| unsloth/Qwen3.5-9B-Q5_K_S | 5.924 | 19.330239 | 0.009137 |
| bartowski/Qwen_Qwen3.5-9B-Q4_K_L | 6.188 | 19.377795 | 0.015064 |
| unsloth/Qwen3.5-9B-UD-Q4_K_XL | 5.556 | 19.355771 | 0.015238 |
| bartowski/Qwen_Qwen3.5-9B-Q4_K_M | 5.485 | 19.409285 | 0.016754 |
| AaryanK/Qwen3.5-9B.q5_0 | 5.872 | 19.516510 | 0.019535 |
| bartowski/Qwen_Qwen3.5-9B-Q4_K_S | 5.197 | 19.426160 | 0.020576 |
| eaddario/Qwen3.5-9B-Q6_K | 6.854 | 19.648966 | 0.021010 |
| bartowski/Qwen_Qwen3.5-9B-Q4_1 | 5.512 | 19.467238 | 0.023208 |
| byteshape/Qwen3.5-9B-Q5_K_S-5.10bpw | 5.329 | 19.532163 | 0.023510 |
| byteshape/Qwen3.5-9B-IQ4_XS-4.98bpw | 5.198 | 19.558089 | 0.024250 |
| bartowski/Qwen_Qwen3.5-9B-IQ4_NL | 5.07 | 19.498178 | 0.024696 |
| mradermacher/Qwen3.5-9B.i1-Q5_K_M | 6.074 | 19.706723 | 0.025498 |
| bartowski/Qwen_Qwen3.5-9B-IQ4_XS | 4.846 | 19.514750 | 0.025705 |
| eaddario/Qwen3.5-9B-Q5_K | 6.024 | 19.714336 | 0.026344 |
| Mungert/Qwen3.5-9B-iq4_nl | 4.972 | 19.562374 | 0.026716 |
| mradermacher/Qwen3.5-9B.i1-Q5_K_S | 5.872 | 19.725820 | 0.027342 |
| Mungert/Qwen3.5-9B-iq4_xs | 4.743 | 19.594639 | 0.027766 |
| mradermacher/Qwen3.5-9B.i1-IQ4_NL | 4.952 | 19.591508 | 0.027867 |
| mradermacher/Qwen3.5-9B.i1-IQ4_XS | 4.722 | 19.621767 | 0.028870 |
| ZeroWw/Qwen3.5-9B.q5_k | 8.435 | 19.830399 | 0.031931 |
| byteshape/Qwen3.5-9B-Q5_K_S-4.75bpw | 4.958 | 19.681021 | 0.032144 |
| AaryanK/Qwen3.5-9B.q5_k_m | 6.074 | 19.846397 | 0.032233 |
| DevQuasar/Qwen.Qwen3.5-9B.Q5_K_M | 6.074 | 19.852639 | 0.032304 |
| eaddario/Qwen3.5-9B-Q4_K-B | 5.485 | 19.858831 | 0.033141 |
| AaryanK/Qwen3.5-9B.q5_1 | 6.334 | 19.748779 | 0.034313 |
| Mungert/Qwen3.5-9B-q4_k_m | 5.564 | 19.841286 | 0.034431 |
| AaryanK/Qwen3.5-9B.q5_k_s | 5.872 | 19.864724 | 0.034770 |
| DevQuasar/Qwen.Qwen3.5-9B.Q5_K_S | 5.872 | 19.882870 | 0.034819 |
| eaddario/Qwen3.5-9B-Q4_K-U | 5.29 | 19.912657 | 0.036301 |
| llmware/Qwen3.5-9B-Q4_K_M | 5.29 | 19.854865 | 0.036925 |
| unsloth/Qwen3.5-9B-Q4_K_M | 5.29 | 19.859386 | 0.037104 |
| eaddario/Qwen3.5-9B-Q4_K | 5.243 | 19.959778 | 0.037505 |
| eaddario/Qwen3.5-9B-Q4_K_M-naive | 5.243 | 19.898625 | 0.038486 |
| byteshape/Qwen3.5-9B-Q5_K_S-4.60bpw | 4.802 | 19.790823 | 0.038704 |
| mradermacher/Qwen3.5-9B.i1-Q4_K_M | 5.241 | 19.908672 | 0.039594 |
| unsloth/Qwen3.5-9B-Q4_K_S | 5.024 | 19.908924 | 0.040750 |
| byteshape/Qwen3.5-9B-IQ4_XS-4.43bpw | 4.626 | 19.800843 | 0.041636 |
| unsloth/Qwen3.5-9B-Q4_1 | 5.436 | 19.903143 | 0.042209 |
| unsloth/Qwen3.5-9B-IQ4_NL | 5.002 | 19.937468 | 0.042506 |
| mradermacher/Qwen3.5-9B.i1-Q4_K_S | 4.974 | 19.977873 | 0.043795 |
| unsloth/Qwen3.5-9B-IQ4_XS | 4.814 | 19.952831 | 0.043811 |
| bartowski/Qwen_Qwen3.5-9B-Q4_0 | 5.074 | 19.864063 | 0.044698 |
| mradermacher/Qwen3.5-9B.i1-Q4_1 | 5.41 | 19.993730 | 0.044785 |
| unsloth/Qwen3.5-9B-UD-Q3_K_XL | 4.707 | 19.833348 | 0.046158 |
| steampunque/Qwen3.5-9B.Q4_K_H | 5.663 | 19.988807 | 0.047851 |
| byteshape/Qwen3.5-9B-IQ4_XS-4.20bpw | 4.384 | 19.994381 | 0.051704 |
| mradermacher/Qwen3.5-9B.i1-Q4_0 | 4.96 | 20.031403 | 0.052661 |
| bartowski/Qwen_Qwen3.5-9B-Q3_K_XL | 5.556 | 20.092393 | 0.058763 |
| Mungert/Qwen3.5-9B-iq3_s | 4.418 | 20.059272 | 0.059535 |
| Mungert/Qwen3.5-9B-iq3_m | 4.418 | 20.072130 | 0.059772 |
| ZeroWw/Qwen3.5-9B.q8q4 | 5.944 | 20.261738 | 0.060661 |
| DevQuasar/Qwen.Qwen3.5-9B.Q4_K_M | 5.241 | 20.299136 | 0.062447 |
| AaryanK/Qwen3.5-9B.q4_k_m | 5.241 | 20.273619 | 0.062641 |
| bartowski/Qwen_Qwen3.5-9B-Q3_K_L | 4.727 | 20.110764 | 0.062688 |
| lmstudio-community/Qwen3.5-9B-Q4_K_M | 5.241 | 20.284701 | 0.063009 |
| unsloth/Qwen3.5-9B-Q4_0 | 5.01 | 20.336317 | 0.064799 |
| bartowski/Qwen_Qwen3.5-9B-Q3_K_M | 4.533 | 20.152567 | 0.067070 |
| AaryanK/Qwen3.5-9B.q4_0 | 4.948 | 20.244066 | 0.067778 |
| AaryanK/Qwen3.5-9B.q4_k_s | 4.974 | 20.421610 | 0.071165 |
| DevQuasar/Qwen.Qwen3.5-9B.Q4_K_S | 4.974 | 20.425910 | 0.071280 |
| Mungert/Qwen3.5-9B-q3_k_m | 4.861 | 20.419780 | 0.073549 |
| eaddario/Qwen3.5-9B-Q3_K | 4.306 | 20.544374 | 0.075912 |
| bartowski/Qwen_Qwen3.5-9B-IQ3_M | 4.349 | 20.411438 | 0.076311 |
| Mungert/Qwen3.5-9B-iq3_xs | 4.289 | 20.262784 | 0.076315 |
| keyuan01/qwen3.5-9b-mix | 4.508 | 20.462178 | 0.082440 |
| mradermacher/Qwen3.5-9B.i1-Q3_K_L | 4.493 | 20.475629 | 0.082614 |
| AaryanK/Qwen3.5-9B.q4_1 | 5.41 | 20.693102 | 0.084915 |
| mradermacher/Qwen3.5-9B.i1-Q3_K_M | 4.299 | 20.565871 | 0.087404 |
| bartowski/Qwen_Qwen3.5-9B-IQ3_XS | 4.197 | 20.598822 | 0.087739 |
| mradermacher/Qwen3.5-9B.i1-IQ3_M | 4.112 | 20.568608 | 0.087748 |
| unsloth/Qwen3.5-9B-Q3_K_M | 4.353 | 20.668516 | 0.088135 |
| Mungert/Qwen3.5-9B-iq3_xxs | 3.982 | 20.749878 | 0.094229 |
| mradermacher/Qwen3.5-9B.i1-IQ3_S | 3.971 | 20.694098 | 0.094688 |
| byteshape/Qwen3.5-9B-Q4_K_S-3.92bpw | 4.095 | 20.856006 | 0.100597 |
| bartowski/Qwen_Qwen3.5-9B-Q3_K_S | 4.3 | 20.918237 | 0.101205 |
| mradermacher/Qwen3.5-9B.i1-IQ3_XS | 3.852 | 20.825952 | 0.105562 |
| AaryanK/Qwen3.5-9B.q3_k_l | 4.493 | 21.068526 | 0.109296 |
| DevQuasar/Qwen.Qwen3.5-9B.Q3_K_L | 4.493 | 21.070038 | 0.109460 |
| bartowski/Qwen_Qwen3.5-9B-IQ3_XXS | 4.052 | 21.074602 | 0.113778 |
| DevQuasar/Qwen.Qwen3.5-9B.Q3_K_M | 4.299 | 21.186911 | 0.117853 |
| unsloth/Qwen3.5-9B-UD-IQ3_XXS | 3.74 | 21.337685 | 0.122042 |
| byteshape/Qwen3.5-9B-IQ4_XS-3.60bpw | 3.766 | 21.935245 | 0.142608 |
| mradermacher/Qwen3.5-9B.i1-Q3_K_S | 3.967 | 21.834745 | 0.146521 |
| unsloth/Qwen3.5-9B-Q3_K_S | 4.02 | 22.041631 | 0.151734 |
| mradermacher/Qwen3.5-9B.i1-IQ3_XXS | 3.533 | 21.757513 | 0.155960 |
| Mungert/Qwen3.5-9B-q2_k_m | 4.11 | 22.583041 | 0.187712 |
| bartowski/Qwen_Qwen3.5-9B-Q2_K_L | 4.649 | 23.033036 | 0.195621 |
| DevQuasar/Qwen.Qwen3.5-9B.Q3_K_S | 3.967 | 23.241273 | 0.204858 |
| byteshape/Qwen3.5-9B-IQ3_S-3.15bpw | 3.291 | 23.628691 | 0.221494 |
| byteshape/Qwen3.5-9B-IQ3_S-3.00bpw | 3.137 | 24.952801 | 0.278109 |
| byteshape/Qwen3.5-9B-Q3_K_S-3.46bpw | 3.614 | 25.713151 | 0.310829 |
| byteshape/Qwen3.5-9B-IQ3_S-2.81bpw | 2.938 | 27.095131 | 0.362968 |
SIZE VS KLD RANKINGS - Qwen3.5-9B-bf16
Efficiency Score: √(Normalized Size² + Normalized KLD²) - bolded KLD Score <0.01 - lower is better
| Rank | Quantization | Size (GiB) | KLD | Eff. Score |
|---|---|---|---|---|
| 1 | mradermacher/Qwen3.5-9B.i1-IQ4_XS | 4.722 | 0.028870 | 0.209539 |
| 2 | Mungert/Qwen3.5-9B-iq4_xs | 4.743 | 0.027766 | 0.210595 |
| 3 | byteshape/Qwen3.5-9B-IQ4_XS-4.20bpw | 4.384 | 0.051704 | 0.210931 |
| 4 | byteshape/Qwen3.5-9B-IQ4_XS-4.43bpw | 4.626 | 0.041636 | 0.215789 |
| 5 | bartowski/Qwen_Qwen3.5-9B-IQ4_XS | 4.846 | 0.025705 | 0.219361 |
| 6 | Mungert/Qwen3.5-9B-iq3_s | 4.418 | 0.059535 | 0.228461 |
| 7 | byteshape/Qwen3.5-9B-Q5_K_S-4.60bpw | 4.802 | 0.038704 | 0.228678 |
| 8 | Mungert/Qwen3.5-9B-iq3_m | 4.418 | 0.059772 | 0.228923 |
| 9 | unsloth/Qwen3.5-9B-UD-Q3_K_XL | 4.707 | 0.046158 | 0.229921 |
| 10 | mradermacher/Qwen3.5-9B.i1-IQ4_NL | 4.952 | 0.027867 | 0.232240 |
| 11 | Mungert/Qwen3.5-9B-iq4_nl | 4.972 | 0.026716 | 0.233334 |
| 12 | unsloth/Qwen3.5-9B-IQ4_XS | 4.814 | 0.043811 | 0.236552 |
| 13 | byteshape/Qwen3.5-9B-Q5_K_S-4.75bpw | 4.958 | 0.032144 | 0.236871 |
| 14 | bartowski/Qwen_Qwen3.5-9B-IQ4_NL | 5.070 | 0.024696 | 0.242012 |
| 15 | mradermacher/Qwen3.5-9B.i1-Q4_K_S | 4.974 | 0.043795 | 0.251854 |
| 16 | bartowski/Qwen_Qwen3.5-9B-Q3_K_M | 4.533 | 0.067070 | 0.252138 |
| 17 | bartowski/Qwen_Qwen3.5-9B-Q4_K_S | 5.197 | 0.020576 | 0.252761 |
| 18 | unsloth/Qwen3.5-9B-IQ4_NL | 5.002 | 0.042506 | 0.252937 |
| 19 | unsloth/Qwen3.5-9B-Q4_K_S | 5.024 | 0.040750 | 0.252950 |
| 20 | Mungert/Qwen3.5-9B-iq3_xs | 4.289 | 0.076315 | 0.254829 |
| 21 | eaddario/Qwen3.5-9B-Q3_K | 4.306 | 0.075912 | 0.255008 |
| 22 | byteshape/Qwen3.5-9B-IQ4_XS-4.98bpw | 5.198 | 0.024250 | 0.255212 |
| 23 | bartowski/Qwen_Qwen3.5-9B-IQ3_M | 4.349 | 0.076311 | 0.258679 |
| 24 | bartowski/Qwen_Qwen3.5-9B-Q3_K_L | 4.727 | 0.062688 | 0.259151 |
| 25 | bartowski/Qwen_Qwen3.5-9B-Q4_0 | 5.074 | 0.044698 | 0.262704 |
| 26 | mradermacher/Qwen3.5-9B.i1-Q4_0 | 4.960 | 0.052661 | 0.262913 |
| 27 | byteshape/Qwen3.5-9B-Q5_K_S-5.10bpw | 5.329 | 0.023510 | 0.268630 |
| 28 | eaddario/Qwen3.5-9B-Q4_K | 5.243 | 0.037505 | 0.271296 |
| 29 | mradermacher/Qwen3.5-9B.i1-IQ3_M | 4.112 | 0.087748 | 0.271508 |
| 30 | eaddario/Qwen3.5-9B-Q4_K_M-naive | 5.243 | 0.038486 | 0.272310 |
| 31 | mradermacher/Qwen3.5-9B.i1-Q4_K_M | 5.241 | 0.039594 | 0.273283 |
| 32 | eaddario/Qwen3.5-9B-Q4_K-U | 5.290 | 0.036301 | 0.274885 |
| 33 | llmware/Qwen3.5-9B-Q4_K_M | 5.290 | 0.036925 | 0.275498 |
| 34 | unsloth/Qwen3.5-9B-Q4_K_M | 5.290 | 0.037104 | 0.275676 |
| 35 | bartowski/Qwen_Qwen3.5-9B-IQ3_XS | 4.197 | 0.087739 | 0.276002 |
| 36 | mradermacher/Qwen3.5-9B.i1-Q3_K_M | 4.299 | 0.087404 | 0.280946 |
| 37 | Mungert/Qwen3.5-9B-iq3_xxs | 3.982 | 0.094229 | 0.281356 |
| 38 | bartowski/Qwen_Qwen3.5-9B-Q4_K_M | 5.485 | 0.016754 | 0.281813 |
| 39 | mradermacher/Qwen3.5-9B.i1-IQ3_S | 3.971 | 0.094688 | 0.282033 |
| 40 | mradermacher/Qwen3.5-9B.i1-Q3_K_L | 4.493 | 0.082614 | 0.282064 |
| 41 | keyuan01/qwen3.5-9b-mix | 4.508 | 0.082440 | 0.282674 |
| 42 | unsloth/Qwen3.5-9B-Q3_K_M | 4.353 | 0.088135 | 0.285815 |
| 43 | AaryanK/Qwen3.5-9B.q4_0 | 4.948 | 0.067778 | 0.286669 |
| 44 | unsloth/Qwen3.5-9B-Q4_0 | 5.010 | 0.064799 | 0.286779 |
| 45 | bartowski/Qwen_Qwen3.5-9B-Q4_1 | 5.512 | 0.023208 | 0.287966 |
| 46 | unsloth/Qwen3.5-9B-UD-Q4_K_XL | 5.556 | 0.015238 | 0.288895 |
| 47 | Mungert/Qwen3.5-9B-q3_k_m | 4.861 | 0.073549 | 0.290196 |
| 48 | eaddario/Qwen3.5-9B-Q4_K-B | 5.485 | 0.033141 | 0.292174 |
| 49 | AaryanK/Qwen3.5-9B.q4_k_s | 4.974 | 0.071165 | 0.294908 |
| 50 | DevQuasar/Qwen.Qwen3.5-9B.Q4_K_S | 4.974 | 0.071280 | 0.295117 |
| 51 | unsloth/Qwen3.5-9B-Q4_1 | 5.436 | 0.042209 | 0.295744 |
| 52 | mradermacher/Qwen3.5-9B.i1-Q4_1 | 5.410 | 0.044785 | 0.295947 |
| 53 | Mungert/Qwen3.5-9B-q4_k_m | 5.564 | 0.034431 | 0.301487 |
| 54 | byteshape/Qwen3.5-9B-Q4_K_S-3.92bpw | 4.095 | 0.100597 | 0.302487 |
| 55 | DevQuasar/Qwen.Qwen3.5-9B.Q4_K_M | 5.241 | 0.062447 | 0.303452 |
| 56 | AaryanK/Qwen3.5-9B.q4_k_m | 5.241 | 0.062641 | 0.303751 |
| 57 | lmstudio-community/Qwen3.5-9B-Q4_K_M | 5.241 | 0.063009 | 0.304321 |
| 58 | mradermacher/Qwen3.5-9B.i1-IQ3_XS | 3.852 | 0.105562 | 0.305304 |
| 59 | bartowski/Qwen_Qwen3.5-9B-Q3_K_S | 4.300 | 0.101205 | 0.314005 |
| 60 | steampunque/Qwen3.5-9B.Q4_K_H | 5.663 | 0.047851 | 0.324685 |
| 61 | AaryanK/Qwen3.5-9B.q5_0 | 5.872 | 0.019535 | 0.324810 |
| 62 | unsloth/Qwen3.5-9B-Q5_K_S | 5.924 | 0.009137 | 0.327254 |
| 63 | bartowski/Qwen_Qwen3.5-9B-Q3_K_XL | 5.556 | 0.058763 | 0.327527 |
| 64 | mradermacher/Qwen3.5-9B.i1-Q5_K_S | 5.872 | 0.027342 | 0.328869 |
| 65 | AaryanK/Qwen3.5-9B.q5_k_s | 5.872 | 0.034770 | 0.333982 |
| 66 | DevQuasar/Qwen.Qwen3.5-9B.Q5_K_S | 5.872 | 0.034819 | 0.334020 |
| 67 | bartowski/Qwen_Qwen3.5-9B-IQ3_XXS | 4.052 | 0.113778 | 0.334185 |
| 68 | AaryanK/Qwen3.5-9B.q3_k_l | 4.493 | 0.109296 | 0.343797 |
| 69 | bartowski/Qwen_Qwen3.5-9B-Q5_K_S | 6.078 | 0.008110 | 0.343888 |
| 70 | DevQuasar/Qwen.Qwen3.5-9B.Q3_K_L | 4.493 | 0.109460 | 0.344191 |
| 71 | eaddario/Qwen3.5-9B-Q5_K | 6.024 | 0.026344 | 0.344536 |
| 72 | unsloth/Qwen3.5-9B-UD-IQ3_XXS | 3.740 | 0.122042 | 0.345356 |
| 73 | unsloth/Qwen3.5-9B-Q5_K_M | 6.126 | 0.007290 | 0.349012 |
| 74 | mradermacher/Qwen3.5-9B.i1-Q5_K_M | 6.074 | 0.025498 | 0.349436 |
| 75 | AaryanK/Qwen3.5-9B.q5_k_m | 6.074 | 0.032233 | 0.353487 |
| 76 | DevQuasar/Qwen.Qwen3.5-9B.Q5_K_M | 6.074 | 0.032304 | 0.353535 |
| 77 | DevQuasar/Qwen.Qwen3.5-9B.Q3_K_M | 4.299 | 0.117853 | 0.355143 |
| 78 | AaryanK/Qwen3.5-9B.q4_1 | 5.410 | 0.084915 | 0.355835 |
| 79 | bartowski/Qwen_Qwen3.5-9B-Q4_K_L | 6.188 | 0.015064 | 0.357446 |
| 80 | unsloth/Qwen3.5-9B-UD-Q5_K_XL | 6.281 | 0.006419 | 0.365840 |
| 81 | ZeroWw/Qwen3.5-9B.q8q4 | 5.944 | 0.060661 | 0.367509 |
| 82 | Mungert/Qwen3.5-9B-q5_k_m | 6.336 | 0.006714 | 0.371882 |
| 83 | bartowski/Qwen_Qwen3.5-9B-Q5_K_M | 6.392 | 0.006604 | 0.377988 |
| 84 | AaryanK/Qwen3.5-9B.q5_1 | 6.334 | 0.034313 | 0.382466 |
| 85 | byteshape/Qwen3.5-9B-IQ4_XS-3.60bpw | 3.766 | 0.142608 | 0.401233 |
| 86 | mradermacher/Qwen3.5-9B.i1-Q3_K_S | 3.967 | 0.146521 | 0.417162 |
| 87 | mradermacher/Qwen3.5-9B.i1-Q6_K | 6.854 | 0.003735 | 0.428270 |
| 88 | AaryanK/Qwen3.5-9B.q6_k | 6.854 | 0.004779 | 0.428327 |
| 89 | DevQuasar/Qwen.Qwen3.5-9B.Q6_K | 6.854 | 0.004801 | 0.428328 |
| 90 | lmstudio-community/Qwen3.5-9B-Q6_K | 6.854 | 0.004905 | 0.428335 |
| 91 | Mungert/Qwen3.5-9B-q6_k_m | 6.872 | 0.003609 | 0.430232 |
| 92 | eaddario/Qwen3.5-9B-Q6_K | 6.854 | 0.021010 | 0.431700 |
| 93 | unsloth/Qwen3.5-9B-Q3_K_S | 4.020 | 0.151734 | 0.432604 |
| 94 | mradermacher/Qwen3.5-9B.i1-IQ3_XXS | 3.533 | 0.155960 | 0.432711 |
| 95 | unsloth/Qwen3.5-9B-Q6_K | 6.946 | 0.003080 | 0.438303 |
| 96 | bartowski/Qwen_Qwen3.5-9B-Q5_K_L | 6.976 | 0.006068 | 0.441758 |
| 97 | bartowski/Qwen_Qwen3.5-9B-Q6_K | 7.134 | 0.002813 | 0.458852 |
| 98 | bartowski/Qwen_Qwen3.5-9B-Q6_K_L | 7.592 | 0.002371 | 0.508922 |
| 99 | Mungert/Qwen3.5-9B-q2_k_m | 4.110 | 0.187712 | 0.531250 |
| 100 | bartowski/Qwen_Qwen3.5-9B-Q2_K_L | 4.649 | 0.195621 | 0.569058 |
| 101 | unsloth/Qwen3.5-9B-UD-Q6_K_XL | 8.156 | 0.001910 | 0.570588 |
| 102 | DevQuasar/Qwen.Qwen3.5-9B.Q3_K_S | 3.967 | 0.204858 | 0.574089 |
| 103 | ZeroWw/Qwen3.5-9B.q5_k | 8.435 | 0.031931 | 0.607067 |
| 104 | byteshape/Qwen3.5-9B-IQ3_S-3.15bpw | 3.291 | 0.221494 | 0.610162 |
| 105 | eaddario/Qwen3.5-9B-Q8_0 | 8.873 | 0.001198 | 0.648989 |
| 106 | lmstudio-community/Qwen3.5-9B-Q8_0 | 8.873 | 0.001410 | 0.648989 |
| 107 | ZeroWw/Qwen3.5-9B.q8_p | 8.873 | 0.001412 | 0.648989 |
| 108 | unsloth/Qwen3.5-9B-Q8_0 | 8.873 | 0.001433 | 0.648989 |
| 109 | AaryanK/Qwen3.5-9B.q8_0 | 8.873 | 0.001445 | 0.648989 |
| 110 | DevQuasar/Qwen.Qwen3.5-9B.Q8_0 | 8.873 | 0.001464 | 0.648989 |
| 111 | bartowski/Qwen_Qwen3.5-9B-Q8_0 | 8.890 | 0.001405 | 0.650848 |
| 112 | ZeroWw/Qwen3.5-9B.q6_k | 9.089 | 0.004625 | 0.672675 |
| 113 | byteshape/Qwen3.5-9B-IQ3_S-3.00bpw | 3.137 | 0.278109 | 0.765743 |
| 114 | ZeroWw/Qwen3.5-9B.q8_0 | 10.649 | 0.001679 | 0.843194 |
| 115 | byteshape/Qwen3.5-9B-Q3_K_S-3.46bpw | 3.614 | 0.310829 | 0.859064 |
| 116 | byteshape/Qwen3.5-9B-IQ3_S-2.81bpw | 2.938 | 0.362968 | 1.000000 |
| 117 | unsloth/Qwen3.5-9B-UD-Q8_K_XL | 12.083 | 0.001243 | 1.000000 |
eval dataset: https://gist.github.com/cmhamiche/788eada03077f4341dfb39df8be012dc 103 chunks at -c 512
ik_llama.cpp: https://github.com/Thireus/ik_llama.cpp/releases/tag/main-b4608-b33a10d
nvidia drivers: 595.97
Enthu-Cutlet-1337@reddit
Q8_0 barely moves KLD; on 12GB VRAM, Q6_K is the real sweet spot, not the shiny one.
Thireus@reddit
Good stuff. It would be nice if you could add the ones produced by https://gguf.thireus.com/quant_assign.html. Cheers.
PhoneOk7721@reddit
I am getting a lot of Failed to decode recipe fragment: Failed to compress with code -10 on your site
Thireus@reddit
I can recommend to access https://gguf.thireus.com/quant_downloader.html and manually upload the recipe file you produced with https://gguf.thireus.com/quant_assign.html which should be in your download folder.
Thireus@reddit
Can you paste the recipe here and tell me which version of llama or ik_llama you are using?
TitwitMuffbiscuit@reddit (OP)
True and thanks for providing the binaries btw and you know, all your contributions. I'd have to produce them right ?
Thireus@reddit
Yes I would advise to produce them, you can set the desired bpw or GiB size via the size input field (default is 50% but you can enter "3 GiB" or "4 bpw" for example).
Alternatively there are already a few examples here: https://github.com/Thireus/GGUF-Tool-Suite/tree/main/recipe_examples/ik_harmonized_recipes to be downloaded with https://gguf.thireus.com/quant_downloader.html
TitwitMuffbiscuit@reddit (OP)
Ok thanks.
The goal here is to provide insight for the already available quants but your project deserves a bit more visibility given the quality of your work. I'll do this for the future Gemma 4 if you don't mind.
I'm still trying to figure out how I would slot them into the tables in all fairness.
It would probably require me to quant them after the whole shebang and try to beat the best Size vs KLD for various bpw even tho this metric is a bit blurry these days., idk.
Thireus@reddit
I see what you’re getting at. From my perspective, though, the quants are already available—they’re just not combined into a single GGUF yet: https://huggingface.co/Thireus/collections. Since mixing and merging is basically just a couple of clicks using https://gguf.thireus.com/quant_assign.html, it feels a bit hard to argue that they shouldn’t be considered available.
For example, here’s a direct download link for the GGUF of Qwen3.5-9B using a llama.cpp preset that matches eaddario’s 3.2130 bpw GGUF: https://gguf.thireus.com/quant_downloader.html?a&n=Qwen3.5-9B.WEB_USER-3.2132bpw-3GB-GGUF_3GB-GPU_0GB-CPU.e56c257_f8a4c5f#0-KLUv_WDOBtUNAFZROSRAiegBuMDQsKxc03wA-yub2KVsug6gzgL9TXKTvBGtVQEicAE0AC0ALQA1EoGeSfaUEnVFUrtVEbcFQYA3GJGEKM6kQGf7g_IHVSQFzVqUNiUEwTAgpzgNDgnRAwNA-4hv7Rhrz8weP0ws5u1Dt_jts5UeGT3do-IIwGYl3kNmufeQ176D4kiOR5EE2xgCtsUe9Tgsj5_3m2cTQqGei8W-X35BYt49V1W1bRd7jUnbHsHf-Jv9ZX8DNdKOech0Wa7G6FEch6BdmDSBdWJdPsuzQTNVbFCO2FOtk0Y-pXtsr5Rir11VoMExVFMKyyko1jggQkIcZHUDkgp2kkPVucE6LRBacxfLNbDhZkPRKcTUph01LU2WDLDrWvYpCQ53C3hZCkpFmmmT5CPxbo8Wwn7muyD3CkOaWaSMw3vWwFRUJIKUV0V2I42zBVR96ctcRaEhiqknpQp3HIYyl_4FMXVs6Lk6txvFZPmgCgCuoBF2lO8K0DJO8E3ADC8pOt9snSYpU6VepoK9VRux9MdGbkHaFcgmQXcZJwV9PBXeqSIEn1PLorDiMMnBxcCBErPanYwOytKvqwI
I don't have plans yet to benchmark Gemma 4.
TitwitMuffbiscuit@reddit (OP)
Okay thanks. I'll try to update this post by tomorrow unless I need some clarifications, I'd message you then.
Thireus@reddit
Sure, don't hesitate to reach out.
Basically - access https://gguf.thireus.com/quant_assign.html, then:
1. Select "llama.cpp" as preset for a fair comparison (I believe all of them are llama.cpp only), ik_llama.cpp would produce much better quality and wouldn't be fair since limited to ik_llama.cpp only.
2. Select Qwen3.5-9B as model
3. Set a "bpw" in the size input that matches some of the best GGUFs you've benchmarked
4. Hit the Produce GGUF Recipe button
5. Hit the GGUF Download button once available
6. Repeat step 3->5 which should now take a second to produce new recipes.
TitwitMuffbiscuit@reddit (OP)
Okay I'll try:
eaddario/Qwen3.5-9B-Q8_0 at 8.50 bpw just to get to the top of KLD rank, hopefully.
unsloth/Qwen3.5-9B-UD-Q5_K_XL at 6.02 bpw
bartowski/Qwen_Qwen3.5-9B-Q5_K_S at 5.82 bpw
unsloth/Qwen3.5-9B-Q5_K_S at 5.67 bpw
mradermacher/Qwen3.5-9B.i1-IQ4_XS at 4.52 bpw
Then 50% at 4.25 bpw
I'd do other quants if it were a bigger model but it doesn't really make sense to add more. I don't want to spam with another post but I'll include some of yours next time.
Thireus@reddit
Great. It’ll be worth mentioning which preset you’ve used. As llama.cop presets will perform worse than ik_llama.cpp quality presets. Or maybe generate both?
TitwitMuffbiscuit@reddit (OP)
I used a mainline compatible preset as you suggested, I'll mention it at the bottom of the post.
I'll include u/ilintar's exotic quants and an ik_llama quality present at equivalence from the website for fairness later today.
Thireus@reddit
Awesome! Indeed, very much all the ik_llama.cpp quants should (at lower bpw) beat the llama.cpp, so it wouldn't be fair to mix them up. I can strongly advise to include u/VoidAlchemy (ubergarm) GGUFs too if available: https://huggingface.co/ubergarm/models
TitwitMuffbiscuit@reddit (OP)
Okay, just so you know I won't update the post out of laziness (also because there's not much visibility right now) but still, your quants needs a nerf.
TitwitMuffbiscuit@reddit (OP)
Ubergarm is mostly interrested in fat MoE models given that he's using ik for CPU inference and there's much more to gain there, which is understandable. So no meek 9B on this repo. That said I've included some of them in previous test: Qwen3.5-27B Q4 Quantization Qwen3.5-35B-A3B Q4 Quantization
VoidAlchemy@reddit
Thanks, correct didn't release any ik quants for Qwen3.5-9B but given some recent discussion on possibly enabling MTP tensor support it might be a good one to experiment with.
I'll holler if I release any experiment Qwen3.5-9B's with ik quants (and maybe figure out how to preserve the MTP tensors and mark them as unused with a patch).
TitwitMuffbiscuit@reddit (OP)
I included them. Congratulations on the results.
It's definitly what I'll use from now on.
Thireus@reddit
Good stuff, thanks for including them.
TitwitMuffbiscuit@reddit (OP)
Tbh it is easy, just 3 clicks on your website and I had the quants.
I just needed to refresh the page in between or delete the cache from time to time.
As I said I'll use your quants on ik quality preset from now on if the model is available, RIP your server.
Thireus@reddit
That's a good idea, however bpw tier could also give more advantage to the ones with higher bpw (unless you apply some extra normalisation). What you could do is compute the "optimum" KLD curve they all follow and compute their distance from that curve. Then rank them by distance to the curve. This is kinda what I've been trying to demonstrate on the graph you see on https://gguf.thireus.com/quant_assign.html - that's for PPL, but the same logic applies to KLD. The tool you could use is https://github.com/Thireus/GGUF-Tool-Suite/blob/main/model_tensor_bpw_metric.py and the wrapper that produces the curve equations you see on https://github.com/Thireus/GGUF-Tool-Suite/blob/main/ppl_graphs/ppl_list.txt is: https://github.com/Thireus/GGUF-Tool-Suite/blob/main/ppl_graphs/ppl_list.sh
TitwitMuffbiscuit@reddit (OP)
I don't want to be the "gentoo is better than ubuntu" type of guy to the people that just made the switch from windows (not that your quants are gentoo-ish).
I'd make a new post if I can formulate the findings in a concise and helpful way to the general r/localllama user as to what quant to pick. KLD presented as faithfulness is fine. Size vs KLD is already a step away from this.
So the distance to the optimum kld curve is a great addition (I'll yoink that) for technically minded people and it's not that different from the efficiency score logic tbh but I need to think about a more intuitive representation, plotting this won't speak to most people.
I don't want to waste your time ranting in there but you've spent a lot of time on GGUF-Tool-Suit's user experience so you understand the challenge.
Ok-Measurement-1575@reddit
This is awesome, thanks.
I would guess quite a few of us are using the 4bit AWQs from the likes of Cyankiwi so it might be worth considering throwing them into the mix, too, if possible.
Anecdotally, they seem at least as good as Q4KL GGUFs.
TitwitMuffbiscuit@reddit (OP)
I certainly agree and i'dlove to do exllama c2/v3 but I try to keep it simple and streamlined given that I only have i3-12100F, 64GB of DDR4-3200, an RTX 3060 12GB and the worst of things, a 1tb drive (just for this eval sweep the weights andlogit takes 665GB).
CheatCodesOfLife@reddit
Mate, this data is awesome, thank you!
Just in case you aren't aware of it, SSDs can handle a finite number of Writes (usually measured in TB) before they cark it.
You can probably google it for you specific drive and keep it in mind if you need the hardware to last.
TitwitMuffbiscuit@reddit (OP)
CheatCodesOfLife@reddit
Windows. I remember using that WinDirStat years ago.
If you want to check the total writes you can use another tool called "CrystalDisk"
Imaginary-Unit-3267@reddit
This is almost exactly the same setup I have! I need to start watching you and trying out all the same models you have success with in future...
Ok-Measurement-1575@reddit
You're doing the Lord's work on that 3060 :D
WhoRoger@reddit
Thireus huh? Haven't used them so far, looking I've interesting.
But I'm confused by which of their quant is which, as they're marked by bpw in the chart. Like the bottom two yellow triangles in the zoomed in section - one is iq4xs I guess? Or is that another label overlapping? And what's the other one?
No idea how to map it onto their releases, which are already confusing enough
TitwitMuffbiscuit@reddit (OP)
Yeah he's using bpw which is more "honest" compared to the usual quant schemes where you can get a "q4" as big as q6 with a custom recipe, even HuggingFace is sometimes confused.
You'd have to use his website at the bottom of the post, it's explained.
That's why it's size vs kld, those that are in the same vertical line are approximately the same size so using pretty much the same amount of vram (given the same context ofc)
WhoRoger@reddit
Oh I see (I think). Tho that kinda makes comparisons meaningless.
I like the idea of rolling your own quant based on hw, that's probably the near future rather than a billion of pre-made quants. But we should still be able to know what kinda quant we're talking about, since models respond differently to being quanted. Like Phi is known to like Q4 because of the underlying structure, or I found recently that Q8 mmproj may work better than F16, and I don't think bpw and kld will always tell you that.
I think we need some new framework to measure such things, based on how models respond to this or that... Maybe add FTs like abliteration or bugfixes into the mix.
But then I guess we're more in benchmark territory, but maybe that's we what we need.
To wit, imagine deciding between a a 9B Q2, 7B Q4, and tuned/fixed (heretic or antislop or whatever) 5B Q6, distilled from the same source, I think that's the kinda eval we might want to generalise.
If we can first find the ideal quant for each model based on its arch, that would make it easier. Cuz at this time, we either have comparisons of different models with the same quant like Q8, or different quants of the same model (like yours), and neither really gives the full picture.
I remember reading seeing here a project where someone was also building an ideal quant based on optimising performance of each layer (I think).
Am I making sense? Just rambling at this point. Thanks for your work and time, tho!
TitwitMuffbiscuit@reddit (OP)
https://github.com/ggml-org/llama.cpp/pull/15550 https://github.com/Thireus/GGUF-Tool-Suite
With llama.cpp you can't do like let's say a saliency map like REAP, it's quantized by blocks and super blocks. https://github.com/iuliaturc/gguf-docs/blob/main/k-quants.md
Imatrix solves pretty much all those issues at a fraction of the compute compared to other solutions (like intel's autoround for example).
Also there's nonsuch things as an ideal quant, it's all subjective. There's always tradeoffs so I'd suggest doing your own calibration dataset to create an imatrix that suits your tasks.
This is what I use for now (while I write a little TUI that will replace my little scripts). https://github.com/cmhamiche/kld-sweep-dataset kudos to https://huggingface.co/datasets/eaddario/imatrix-calibration.
WhoRoger@reddit
Maybe eventually we'll end up standardising comparisons by file size / RAM usage, such as 12GB quant of 27B model vs 12GB quant of 16GB model. I think that's more practical, since at the end it's about what can we actually run, especially for local use.
TitwitMuffbiscuit@reddit (OP)
Well that's another story , Let's say I put a model in the 12gb bracket but it runs at 8 t/s, should it be included?
Depending on the model architectures, the type of model (like MoE with n experts offloading vs dense), then the option used (kv cache compression or not or nkvo), context size (full or not), it might or not pass the arbitrary ceiling you've put.
Those are all the things people will nitpick about.
WhoRoger@reddit
That depends on what you want to compare.
If you think in terms of car or phone comparisons, they are usually focused on a specific threshold parameter, like price, but then may include contenders from various categories, like new/used, comfort/performance and whatnot. So it's like "what can I get for this amount of money". Or they can just test a specific category focus, like "new gaming phones around 500€".
So an LLM equivalent could be something narrow like "20+B MoE coding agents" or broad like "what can I fit into 12GB". The important part is to have the framework thought out so that even models from totally different categorise can be compared. Or, relevant to this topic, quants.
letsgoiowa@reddit
Amazing work but unreadable. All I care about is what is the best at each tier of VRAM usage or each quant level? I don't need to see 123 separate models I just want maybe 5 to compare.
TitwitMuffbiscuit@reddit (OP)
The plot is unreadable that's for sure, that's why there's the tables.
There's no such thing as a quant level anymore as repos use their own recipes you can end up with a "q4" that is the size of q5 for example.
Tiers of vram usage is another story, if I pick an arbitrary context value people will also complain. Also I only have 12.gb of vram so what tiers? 6/8/12 and that's it?
If you want to compare 5 quants just use llama-perplexity, it's pretty quick.
letsgoiowa@reddit
Here I'll do it for you. Hopefully this benefits someone who just wants to pick what fits in VRAM based on model weights+whatever they choose for kv.
Above 8 GB: eaddario/Qwen3.5-9B-Q8_0
8 GB, use for a 3080 with low context or 12 GB GPU for high context: bartowski/Qwen_Qwen3.5-9B-Q6_K_L
7 GB: unsloth/Qwen3.5-9B-Q6_K
6 GB: Thireus/Qwen3.5-9B-5.6704bpw
TitwitMuffbiscuit@reddit (OP)
I updated it, there should be a bit more separation in the plot now.
TheLexikitty@reddit
excellent work, thank you! 💞
TitwitMuffbiscuit@reddit (OP)
You're welcome!
Xamanthas@reddit
If you are going to do this mean I strongly suggest using different shapes as well. Say circle for unsloth, square for bart etc. Just dont use star lol.
IrisColt@reddit
TitwitMuffbiscuit@reddit (OP)
Here we go, not a single star.
It's so dense it's unreadable tbh. So I'd recommend focusing on the charts instead, even if they're boring.
Xamanthas@reddit
Its a lot better honestly :)
TitwitMuffbiscuit@reddit (OP)
I know, I know, you are right, I'll update with shapes asap, what about some rhombicosidodecahedrons?
srigi@reddit
He has a point. 10% of population has problem distinguishing colors. You force those 10% to use colorpicker where there is a simple solution using shapes.
Your passive-agressive joke makes you no internet points in the debate.
TitwitMuffbiscuit@reddit (OP)
That's why I used Tol colorblind friendly colors in the previous plot and why I'm currently redoing the plot with shapes. Give me some time.
FenderMoon@reddit
You’re fine. People are getting offended for no reason.
Folks who are still complaining need to learn to read, you already said you’re redoing the graph with shapes.
Xamanthas@reddit
xd
MrB0janglez@reddit
Solid eval. The Q8_0 clustering at the top makes sense given it preserves near-lossless weights at a reasonable size.
For anyone running on tighter VRAM the Q4_K_M is still the sweet spot for most use cases. The KLD gap vs Q8_0 is small enough that you won't notice it in real workloads, but you get a meaningful size reduction that actually fits in memory.
Thanks for doing this properly with KLD instead of just PPL -- perplexity alone doesn't tell the full story.
etsmsj@reddit
Coming from an academia perspective, neither PPL nor KLD alone are a good metric of model quality. KLD must also be paired with downstream task evaluation otherwise it can miss many things. But real task evaluation is really expensive, so generally speaking people are not going to be able to run extensive benchmark suites on their models. What we’ve seen is that whenever you try to publish a paper that uses any type of proxy metric as a way of saying that your method is good, will have pushback and not get published. And they are right, we’ve seen good PPL and KLD on models that end up being bad when evaluated properly. Sadly, KLD is a cheap somewhat alternative that will give you some results to compare in some specific cases. There was a paper in NeurIPS 2024 that talked about this specifically. It’s good read: “Accuracy is Not All You Need”
TitwitMuffbiscuit@reddit (OP)
Oh yeah eval is a rabbit hole.
Last year I translated gsm8k-platinum to my native language to check on quantized models (it's probably saturated with recent models now).
I was using https://github.com/EleutherAI/lm-evaluation-harness
Like, what's the one that is not completely saturated by recent models and representative of the type of tasks I run?
Is it qualitative or is there bad/vague questions on the dataset?
What's the latest, the quickest to run?
Is it using an LLM as a judge? I mean can we discard all those old benchmarks that used gpt3.5 back in the days?
What's zeroshot, n shot, etc.
MMLU-Pro does extraction with regex (to discard the reasoning for example), it has to be configured.
GPQA Diamond can be with or without chain of though and after audit the "inherent error rate lower bound is 26.8%"
LiveCodeBench requires adding a new model manually, always consult the errata before picking the task.
Then Math-500 which is saturated and represents 13% of the test.
Eval is hard, PPL/KLD is easy and tbf the metric is different.
KLD measurement is like having the Mona Lisa and a copy and evaluating the quality of the copy, it's not about how beautiful the painting is.
That's why I just tell people it's "faithfulness" not "best".
I also did this quick test to showcase the importance of the imatrix those repo uses: https://huggingface.co/spaces/cmh/Qwen3.5-9B-GGUF-quant-drift
Top-Rub-4670@reddit
Are you suggesting that PPL/KLD is not a good metric even as a first pass?
If I want sort PPL/KLD for all quants, then reject all the bottom performers and perform real benchmark only on the top half, am I at risk of eliminating a better performing quant?
etsmsj@reddit
Yes, you are at risk of eliminating some quants that might be better than what you think. And this is not only applicable to quants, but to any quantization method
Final-Frosting7742@reddit
Amazing work. Do you know if those results still apply for llama.cpp?
TitwitMuffbiscuit@reddit (OP)
Oh yeah absolutely.
You'll get the same results within the margin of errors (apparent in the Q8_0 cluster that are essentially the same despite the slight KLD difference).
They are also all llama.cpp compatible.
I think I'll do a separate post specifically for the ik_llama geeks or those who wanted to test some exotic llama.cpp PR that has not been accepted yet.
ilintar@reddit
u/TitwitMuffbiscuit if you want to compare some nonstandard mainline quants for fun, here's a \~4.5bpw quant done using Q4_DPT and IQ3_QT from https://github.com/ggml-org/llama.cpp/pull/19941
https://huggingface.co/ilintar/Qwen3.5-9B-GGUF/blob/main/Qwen3.5-9B-IQ3_Kv2.gguf
Mean KLD: 0.040836 ± 0.000672
Recipe:
token_embd=q4_dpt
output=q6_k
attn_q=q4_dpt
attn_k=q4_dpt
attn_v=q4_dpt
attn_gate=iq3_tq
attn_qkv=q4_dpt
attn_gate=iq2_tq
ssm_alpha=q8_0
ssm_beta=q8_0
ssm_conv1d=f32
ssm_out=q4_dpt
ffn_gate=iq3_tq
ffn_up=iq3_tq
ffn_down=q4_dpt
TitwitMuffbiscuit@reddit (OP)
Well... since it's not merged yet I'd have to compile llama.cpp not a big deal if I did a little table below for a couple of quants but the real problem is that I'm completly out of disk space, I can't even install cuda and MSVC without deleting quants. I'll do that tonight.
TitwitMuffbiscuit@reddit (OP)
Hi u/ilintar, but it will have to be tested against what GGUF-Tool-Suite is able to cook with the ik_llama.cpp Quality preset. At the same bpw ofc, for fairness.
IrisColt@reddit
Might want to use a log scale on the y axis
TitwitMuffbiscuit@reddit (OP)
Yeah I should, it would be more legible but I though the non technical people might struggle a bit.
cviperr33@reddit
Gold mine !! Any data in bigger models ? like the qwen moe or sense 27 , or gemma 4
TitwitMuffbiscuit@reddit (OP)
Only my older posts like Qwen3.5-27B Q4 and Qwen3.5-35B-A3B Q4. Gemma 4 will have to wait for now unfortunately, the time for llama.cpp to iron the kinks
Miserable-Dare5090@reddit
Afaik seeing gemma’s outputs in claude code as the code agent, the kinks are worked out
TitwitMuffbiscuit@reddit (OP)
I'll wait a bit longer, given the amount of activity on llama.cpp's github repo lately regarding this model.
Also I don't see a many quants as of now and I don't want to skip a bunch of them because they've been late to the party.
Dorkits@reddit
People like you deserve to be a mod in this sub. Thank you.
fallingdowndizzyvr@reddit
I'm of the opinion that either you post or you mod. I rather have OP post.
TitwitMuffbiscuit@reddit (OP)
I'd rather split my scapula but I appreciate the kind words.
amejin@reddit
cyankiwi?
Far-Low-4705@reddit
i really like how you calculate the efficiency, most of the time it is some arbitrary metric that makes no sense.
it is still technically arbitrary but this is an actual thoughtful efficiency calculation by calculating the normalized distance from a "perfect" quant rather than like multiplying KLD by file size which ive seen b4...
TitwitMuffbiscuit@reddit (OP)
Thanks. While it's not what you are referring to, I did a drift comparison on 256 tokens previously. https://huggingface.co/spaces/cmh/Qwen3.5-9B-GGUF-quant-drift
fallingdowndizzyvr@reddit
Sweet. Super useful. For me personally, I either tend to go Bartowski or Unsloth. Based on these numbers, I guess I'll be leaning towards Bartowski.
discostupid@reddit
This is amazing, great work.
Is it possible to add Qwen3.5-9B-HLWQ-Q5?
TitwitMuffbiscuit@reddit (OP)
If you meant caiovicentino1/Qwen3.5-9B-Claude-Opus-HLWQ-Q5 I can't sorry.
It's a finetune (and no gguf available). It would score horribly on this particular KLD comparison.
If you want to have a go, you can use this script (that needs to be updated to include the inset), quantize the finetune and compare to the bf16.
discostupid@reddit
If you type that exact model name it is not the fine-tune. But regardless there is no gguf if you need that.
Thanks
david_0_0@reddit
interesting comparison. do you have thoughts on how the newer quantization methods like iq2xs compare to traditional gguf? ive noticed some newer methods sacrifice a bit of accuracy for inference speed but curious if the tradeoff is worth it in practice.
TitwitMuffbiscuit@reddit (OP)
Take what I'll say with a grain of salt, I don't have the hardware to run big models and I don't fw finetunes (I don't care 1bpw about creative writing).
I'm just wondering if it's fair for the labs if I claim that I can run their model at 2bpw when it's obviously crapping itself on agentic tasks compared to baseline
That said, it really depends on your use case. For preserving tool/function calling and non english, I'd say at 2 and 3 bpw quants probably okay for 400B+ models. I don''t expect much at ~200B.
But then it depends on the model's architecture, training (Qwen models for example are very verbose so your inference will be more prone to drifts given the longer generation) and engine support ofc.
Sorry for the vague but nuanced take.
Positive-Violinist90@reddit
Thank you so much for your contribution, I literally wake up today thinking about perform a ppl tests over the qwen models to compare with some experiments that I'm doing. Nice work
TitwitMuffbiscuit@reddit (OP)
You're welcome.
g3n3s1s69@reddit
Perfect write up! I downloaded half a dozen qwen3.5-9b variants to test myself and starred a dozen more in my reddit save, but this easily helps me with sorting through them. Amazing work!
Late_Meal_6034@reddit
Great work!! It would be really useful also the same work with the 27b and 35b and for the two gemma4 models.
PaceZealousideal6091@reddit
Thanks a lot! This is wonderful! Looks like mradermacher's i1 quants are punching way above their weight. Can also please update your previous "Qwen3.5-35B-A3B Q4 Quantization Comparison" ? It was done before Unsloth updated their quants without the mlx. Also, adding i1 quants to the mix might make things more interesting.
TitwitMuffbiscuit@reddit (OP)
I updated it march 1st. https://www.reddit.com/r/LocalLLaMA/comments/1rfds1h/qwen3535ba3b_q4_quantization_comparison/
Unless you meant something else ?
PaceZealousideal6091@reddit
Oh! I must have missed it. Thanks. It would be lovely if you could add the mradermacher quants as well. Especially i1 quants, given your findings for 9B.
TitwitMuffbiscuit@reddit (OP)
Okay, no promises tho, I'd have to redo the whole thing.
PaceZealousideal6091@reddit
I understand. If you ever get to it, please also consider the fixed quants by u/evilengineer. He claims to have found a few broken tensors from the qwen source itself which affects the kld and fixed them. https://www.reddit.com/r/LocalLLaMA/s/KdBDauoO3B
TitwitMuffbiscuit@reddit (OP)
Got it, thanks.
PaMRxR@reddit
I really wonder what they are doing! So far I was convinced byteshape are unbeatable per-byte, but it doesn't quite seem so with these results.
Healthy-Nebula-3603@reddit
So as we know from more than a year q4km is a sweet spot :)
TitwitMuffbiscuit@reddit (OP)
For such a small model it's not that simple and most of the time people pick the largest quant that fits their vram, so it's a bit more nuanced.
PiratesOfTheArctic@reddit
The graph is fantastic, for someone like me who can't get to grips with the different names, a visual guide is cracking
Thankyou
insanemal@reddit
This is fantastic.
Also it confirms something I've long suspected.
dampflokfreund@reddit
and what would that be?
insanemal@reddit
Oh a bunch of stuff, but mainly it shows what I thought it would about the drift from the base.
bonobomaster@reddit
This is awesome. Thank you!
If you ever feel bored at some point in time. I would really be interested in Qwen3.5-27B quant performance. :D
TitwitMuffbiscuit@reddit (OP)
I did Q4 a while back because I only have a RTX 3060 12gb. https://www.reddit.com/r/LocalLLaMA/comments/1rk5qmr/qwen3527b_q4_quantization_comparison/
bonobomaster@reddit
Yeah, I just found that, after stalking your profile.
Guess I'm gonna check out some Bartowski quants!
P.S.: I agree with the other poster regarding the shapes for the plot. The delta E between the colors you chose is sometimes so small, that it is really difficult to see what is what.
Great job anyways. Thanks!
dampflokfreund@reddit
Excellent, fantastic work yet again, very valuable stuff. Can you do Gemma 4 too please? Especially the MoE, wonder how much lower quants impact it.
TitwitMuffbiscuit@reddit (OP)
Thank you. For Gemma 4, I need to wait for llama.cpp to iron things out, more repos to pop up and the old ones to update, but I’m definitely planning to them in the near future, 100%.