Gemma 4 31B GGUF quants ranked by KL divergence (unsloth, bartowski, lmstudio-community, ggml-org)

[-]

Top-Rub-4670@reddit

Another interesting observation from the data you've collected is that there is a marginal improvement from K_M to K_L, but basically none from K_L to K_XL (across non-UD quants).

[-]

suprjami@reddit

Can we stop using Wikipedia text for KLD?

As Unsloth have said in their Qwen 3.5 research post, measuring KLD against a dataset with instructions, tool calling, and code examples produces a more useful result.

imo this is potentially interesting but currently no more relevant than perplexity.

I hope Unsloth repeat their Qwen quantisation and KLD research with this model too.

[-]

Top-Rub-4670@reddit

Can we stop using Wikipedia text for KLD? imo this post is potentially interesting but currently no more relevant than perplexity.

Maybe you should have read the post, they specifically did not use Wikipedia. https://localbench.substack.com/i/193437959/dataset

[-]

suprjami@reddit

Are you planning to repeat your Qwen quant research and your KLD dataset with Gemma 4 too? imo that's actually really useful and produces the best local models.

[-]

StepJumpy4782@reddit

wow super cool great work.

could you explain more on what long document dataset and prompts were? since that is a notable usecase for me and your data showing it performing the worst is interesting to me.

how are you running these? among latest qwen you said were working on, could we get some analysis like this for the really big ones (GLM 5.1 just came out :)). anything the community could help with in that regard?

[-]

-p-e-w-@reddit

Fascinating results. Q8 is often described as “indistinguishable” from FP, yet according to your numbers, even with greedy decoding 1 in 10 tokens is different. That seems quite significant.

[-]

oobabooga4@reddit (OP)

I was surprised by this too, maybe q8_0 = almost lossless was true back in 2023 when models were undertrained. Now with a 31B that beats previous models 10x its size, maybe those extra bits aren't so dispensable anymore.

[-]

Due-Memory-6957@reddit

Not really, Q8 is described as a very good quant, "almost" indistinguishable, not actually indistinguishable. What is described as indistinguishable is FP16. It's like 320kbs vs FLAC VS WAVE

[-]

AnonLlamaThrowaway@reddit

One thing I'd like to find out in particular is how much the use of an imatrix can penalize:

non-English languages
quality of very long contexts

The information I've found on the subject seems to imply that the imatrix's data can significantly bias a model towards English & shorter contexts during quantization... and it would make sense that the KLD benefit is coming from there (la

[-]

WhoRoger@reddit

Looking at the chart closer, I did not expect IQ4_XS to fare that well. When I go for Q4, I usually pick IQ4NL these days. So that's interesting.

Would you care to also test weighted/imatrix on better quants? I think those are interesting and sorta unloved.

Would love to see some representation of Heretics too, at least one or two quants for reference...

[-]

WhoRoger@reddit

If even Q8 gives 0.2 div, why don't people by default use Heretic models that add like 0.01?

These corpo models are so sensitive, it's very easy to run into the censor by accident. Unsloth and Bartowski are great and all, but I don't use them because I flip out if my computer starts lecturing me or dances around a legit question, so I'm not risking it.

[-]

No-Setting8461@reddit

personally I dont use heretic cus (1) I dont have a need for uncensored (academic stuff only) and (2) KLD/PPL don't capture the full picture of what un-censoring might be doing to the model.

[-]

WhoRoger@reddit

From my totally unscientific testing, it seems Heretics tend to hallucinate marginally less. But it's hard to say where the difference is exactly, since people tend to quant one or the other. Either way, quants from different providers definitely differ way more, never mind different quants.

[-]

CircularSeasoning@reddit

Thanks for doing this.

[-]

silenceimpaired@reddit

Shame no creative writing /editing testing was done, but I know I’m a minority

[-]

brown2green@reddit

Even Q8_0 shows a KL of 0.45 on long documents and 0.24 on non-Latin scripts. All categories roughly double from Q8_0 to Q5_K_S, but science and tool use remain the lowest throughout (0.07 and 0.08 at Q8_0).

This looks like it's a significant finding. Most people assume Q8_0 to be virtually the same as BF16.

[-]

Due-Memory-6957@reddit

Haven't we seen that this isn't the case since Llama 3? I remember a lot of discussions around them about how as the training data size increases, more we start to lose when we finetune and quant.

[-]

ambient_temp_xeno@reddit

Back in the day it did seem like q8 quants were 'virtually lossless' for all practical purposes. Longer contexts and higher model temps probably changed this, although it might still be practical to not worry about it - people are wanting to quant their KV cache to q8 now after all.

[-]

Awkward-Boat1922@reddit

I used to think Q4KL was virtually lossless until I tried aider bench.

[-]

No-Refrigerator-1672@reddit

From what I remember, models at Q8 retain their benchmark score within few percents. As most things LLMs do don't have a single correct answer, it's possible that KL low KL divergence is an indication of model choosing another correct answer, not just simply losing the intelligence. But don't quote me on that.

[-]

ambient_temp_xeno@reddit

Lower is better in KL divergence. I get what you're saying though, and since reasoning models self-correct, it's probably more likely they can come up with the right answer.

[-]

SnooPaintings8639@reddit

But is KL on long document actually a good measure of quality? I am a bit of a noob in ML, but for long generation KL would mean more variance, which is correlated with lower quality, but not necessarily in a linear fashion. I.e. KL might show large divergence but more real-world benchmarks still show near no degradation in more functional tasks scores (which I believe were the basis of q8 equals near lossless)

[-]

oobabooga4@reddit (OP)

I was surprised by this as well.

[-]

brown2green@reddit

Maybe not so surprising since people mostly do measurements on wikitext with 512 tokens context.

Could we have a graph showing KLD broken down by task, perhaps with the best quantizations for a given size range?

How long are the "long documents" in your dataset?

[-]

oobabooga4@reddit (OP)

The longest prompts are around 30k tokens.

Could we have a graph showing KLD broken down by task, perhaps with the best quantizations for a given size range?

Sure, I just updated the blog post with per-category KLD plots. They are at the bottom of the post.

[-]

brown2green@reddit

Thanks for the plots. I meant doing something like this: https://i.imgur.com/dOte8Yr.png

I find the data strange, though, because between Q6_K and Q8_0 there's not much difference for all tasks (including Long Documents), so the gap from BF16 is hard to explain.

[-]

Septerium@reddit

I am confused now. Some people say KLD is a better metric compared to perplexity because it would not depend on the dataset

[-]

oobabooga4@reddit (OP)

KLD depends on the evaluation dataset, same as perplexity. The difference is that KLD measures divergence from a reference model rather than from the training data.

[-]

No-Refrigerator-1672@reddit

I believe it is down to the model architecture - it is using sliding window attention in some layers, which may lead to accumulation of errors over context length at low precisions.

[-]

oobabooga4@reddit (OP)

The same thing happens for Qwen 3.5 27b (I was in the middle of benchmarking the Qwens when Gemma 4 dropped, so I finished Gemma first, these things take forever to collect).

[-]

Far-Low-4705@reddit

this is really good stuff, please keep doing this for new models, we really need to benchmark stuff instead of "trust me bro" AI hype.

also try to keep your data set private, if ppl making the quants get a hold of it then it's useless.

also, i think it is really important to test long context as you did. this is rarely tested, but i think it is the most important since this is considered the "worst case" scenario for this sort of stuff.

and a depth of only 2k tokens is completely unrealistic, especially with agentic work.

Q8_0 showing a KL of 0.45 on long documents is very telling, and honestly aligns with what I've been suspecting this whole time.

[-]

Far-Low-4705@reddit

i really dont think that is the case.

ofc it is a large factor, but i think KL divergence has only really traditionally been tested on shorter inputs, not long context, which models tend to fall appart on.

not to mention, dynamic quants only really pay attention to lower contexts as well since again it is far cheaper than checking every depth

[-]

MichiruMatsushima@reddit

Wait a minute... \~0.5 at Q4KM?! Shouldn't KLD normally stay below 0.1 for such a non-aggressive quant? I'm pretty sure I've seen like 0.01 - 0.03 for other models, with Q5 getting into 0.00X territory.

[-]

oobabooga4@reddit (OP)

People usually benchmark KLD with wikipedia at low contexts. It's a lot easier to score well there.

[-]

Awkward-Boat1922@reddit

This is no doubt the first time we've witnessed the true penalties of quantisation.

[-]

MichiruMatsushima@reddit

Oh, I see. Well, that doesn't make it any less confusing. I mean, are the GGUFs we have now the best possible? Or is there a room for bringing Q8 to this elusive idea of 'lossless' quantization?

[-]

Pentium95@reddit

I wonder how ubergarmin's ik_llama.cpp quants perform here. That would be a very interesting benchmark

[-]

Awkward-Boat1922@reddit

If I could upvote this a thousand times... much appreciated!

I don't suppose you've done the same for Qwen3.5?

[-]

Fresh_Month_2594@reddit

Amazing work ! I've always just used Q8 but my use cases are mostly long context. May need to reconsider blindly using Q8.

[-]

guiopen@reddit

Incredible benchmarks, thank you

[-]

ThrowawayProgress99@reddit

Huggingface says Unsloth's gemma-4-31B-it-UD-IQ2_M.gguf is 10.8 GB but I just noticed the download bar says it's only 10 GB. Similar thing happened with their Qwen3.5-27B-UD-IQ3_XXS.gguf, which says it's 11.5 GB, but is 10.7 GB. I chose that Qwen quant because of some graphs that showed it wasn't that bad. I haven't used it extensively but it seems fine to me too.

Between gemma-4-31B-it-UD-IQ3_XXS.gguf and gemma-4-31B-it-UD-Q2_K_XL.gguf, which should I choose? They're both 11.8 GB on Huggingface (while their Qwen GGUFs have the latter at .3 smaller), so probably just \~11 GB on disk. The graph here says the latter is both better and smaller, but I thought higher quant levels were supposed to be better?

[-]

RandomTrollface@reddit

They seem basically within margin of error on the benchmarks. I am using the UD_IQ3_XSS personally and it works fine for me. But I haven't tried the Q2_K_XL to compare.

[-]

tmvr@reddit

The difference in size is just in the representation - GB vs. GiB:

10.7 GiB = 10 957 MiB = 11 219 763 KiB = 11 489 037 517 bytes = ~11.5 GB

[-]

No-Setting8461@reddit

id go for the IQ3. more bits.

[-]

hyperdeath666@reddit

This is really good stuff, thanks OP.

Unsloth folks, if you're reading this, I'd love to hear more about UD-Q8_K_XL (the quant I've been using). Does it have other virtues that KL-divergence doesn't capture? Do you have any internal benchmarks of Gemma 4 31B you can share? Not asking this in an accusatory way or anything - really appreciate your work and just looking to learn more.

[-]

Beginning-Window-115@reddit

what's the point of using dynamic 8 bit when you could just use normal 8 bit

[-]

Separate-Forever-447@reddit

Because 8-bit is not the gold standard, its also a quant, and q8_k_xl retains more information (\~8.5 bits/weight) than 8-bit (8 bits/weight)

[-]

Separate-Forever-447@reddit

They do. And they address your specific points...

"lower perplexity or KLD doesn’t necessarily translate to better real-world performance"

https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

[-]

danielhanchen@reddit

Hi yes - so sometimes this is a quirk of quantization - sometimes a combination of BF16 and Q8_0 can change dynamics - we're still actively improving the recipe - but not KLD and PPL generally for larger quants is better

[-]

Septerium@reddit

I have always preferred Q8_0 over UD-Q8_K_XL
It runs faster, eats less VRAM and I've never seen any slight quality difference between them

[-]

StorageHungry8380@reddit

N00b question regarding the final KL number. Presumably the actual tokens that are part of the top-40 tokens varies from prompt to prompt, and may not fully overlap between test model and reference model, so how exactly is the individual KL divergence calculated? And how are the individual KL figures aggregated?

[-]

oobabooga4@reddit (OP)

For each token position, both models return their top-40 token log-probabilities. The KL is computed over the union of tokens that appear in either top-40 list. For tokens missing from one list, a floor logprob is used (lowest observed logprob in that distribution minus 2). The per-token KL values are then averaged across all tokens and all prompts.

Top-40 covers virtually all the probability mass in practice, so the approximation error is negligible.

[-]

StorageHungry8380@reddit

Thank you for the details. Again, not an expert, but given that hallucinations and similar is a concern, wouldn't it be relevant to include the P99 metric or similar as well as the average? It seems the straight average can suppress if the model sometimes goes completely off the rails.

I'd rather have a model which is mostly just a tad worse, compared to one which is pretty good but occasionally catastrophically bad.

[-]

LeonTheTaken@reddit

Beautiful. Can you provide the file for KL divergence test? Also, can you do Qwen3.5-35B-A3B or Qwen3.5-27B next?

[-]

oobabooga4@reddit (OP)

Qwen 3.5 is next.

[-]

LeonTheTaken@reddit

Cool.

[-]

Potential-Gold5298@reddit

Interesting information, thank you. The large dense model tolerates quantization quite well. It would be very interesting to see the results of the Gemma 4 26B-A4B, since MoE is usually more sensitive to quantization. ggml-org and lmstudio-community apparently use the quantizer without iMatrix, which gives the same results for everyone. You can use mradermacher's quants (without i1) since it has a wider selection (from Q2 to Q8), and the result will be the same as ggml-org /lmstudio-community . Especially interested in Q5_K_M, Q4_K_M, and Q6_K.

[-]

sine120@reddit

Welp, I guess me and my 16GB VRAM are going Q2 for this one.

[-]

xXprayerwarrior69Xx@reddit

what is considered "good" result ? why is it happening mostly on the long documents ?

[-]

2022HousingMarketlol@reddit

From the third paragraph of the linked article:

Lower KL divergence is better. It measures how different the quantized model’s token probability distribution is relative to the original model. A KL of 0 would mean the quant is identical to the original.

[-]

xXprayerwarrior69Xx@reddit

Lower I understand but how low is good ?

[-]

2022HousingMarketlol@reddit

Good is subjective really - it's just like using a "smaller" model.

You'll always want to have the largest quant you can host without sacrificing other things like cache and performance.

I tend to do this, find how big of a model I can run. For me its normally around 15gb. I find the model I want to run ( gemma-4-26B-A4B-it-GGUF) see its a 50gb big boy at 16 bit. I need to get a quantization of this model to run it and get the performance needed to make it worth while. As I go down the quantization last from 16->8->6->5 etc the model gets smaller and smaller but gets less precise. Imagine if your accountant could only count to 1000, that could be fine for most day to day transactions but at the end of the ear when you do your taxes he may start making mistakes. It's a similar deal, written word probably wont be a problem. Complex logic problems will start to degrade etc.

[-]

xXprayerwarrior69Xx@reddit

Thanks a lot for the detailed explanation !

[-]

SSOMGDSJD@reddit

I appreciate you running these tests and specifically calling out the weakness in wiki testing. I ran into that myself when trying to build out a speculative expert prefetch system, and had promising results from wiki data, but it fell apart when I introduced my own API call data to the testing.

However given the diversity of your test sets, we're basically confirming that the least used weights get crushed the most and that it does affect their quality, right? So the smart quants are working as intended?

So really if you're using a quant model for anything besides agentic/scientific workloads, you're going to want a quant specifically tuned for what you are asking the model to do regularly

Also, what sources did you use for the science category?

[-]