Gemma 4 31B GGUF quants ranked by KL divergence (unsloth, bartowski, lmstudio-community, ggml-org)
Posted by oobabooga4@reddit | LocalLLaMA | View on Reddit | 77 comments
Top-Rub-4670@reddit
Another interesting observation from the data you've collected is that there is a marginal improvement from K_M to K_L, but basically none from K_L to K_XL (across non-UD quants).
suprjami@reddit
Can we stop using Wikipedia text for KLD?
As Unsloth have said in their Qwen 3.5 research post, measuring KLD against a dataset with instructions, tool calling, and code examples produces a more useful result.
imo this is potentially interesting but currently no more relevant than perplexity.
I hope Unsloth repeat their Qwen quantisation and KLD research with this model too.
Top-Rub-4670@reddit
Maybe you should have read the post, they specifically did not use Wikipedia. https://localbench.substack.com/i/193437959/dataset
danielhanchen@reddit
Nice benchmarks and great work!
suprjami@reddit
Are you planning to repeat your Qwen quant research and your KLD dataset with Gemma 4 too? imo that's actually really useful and produces the best local models.
xspider2000@reddit
So which quant is pareto-optimal?
StepJumpy4782@reddit
wow super cool great work.
could you explain more on what long document dataset and prompts were? since that is a notable usecase for me and your data showing it performing the worst is interesting to me.
how are you running these? among latest qwen you said were working on, could we get some analysis like this for the really big ones (GLM 5.1 just came out :)). anything the community could help with in that regard?
-p-e-w-@reddit
Fascinating results. Q8 is often described as “indistinguishable” from FP, yet according to your numbers, even with greedy decoding 1 in 10 tokens is different. That seems quite significant.
oobabooga4@reddit (OP)
I was surprised by this too, maybe q8_0 = almost lossless was true back in 2023 when models were undertrained. Now with a 31B that beats previous models 10x its size, maybe those extra bits aren't so dispensable anymore.
Due-Memory-6957@reddit
Not really, Q8 is described as a very good quant, "almost" indistinguishable, not actually indistinguishable. What is described as indistinguishable is FP16. It's like 320kbs vs FLAC VS WAVE
AnonLlamaThrowaway@reddit
One thing I'd like to find out in particular is how much the use of an imatrix can penalize:
The information I've found on the subject seems to imply that the imatrix's data can significantly bias a model towards English & shorter contexts during quantization... and it would make sense that the KLD benefit is coming from there (la
WhoRoger@reddit
Looking at the chart closer, I did not expect IQ4_XS to fare that well. When I go for Q4, I usually pick IQ4NL these days. So that's interesting.
Would you care to also test weighted/imatrix on better quants? I think those are interesting and sorta unloved.
Would love to see some representation of Heretics too, at least one or two quants for reference...
WhoRoger@reddit
If even Q8 gives 0.2 div, why don't people by default use Heretic models that add like 0.01?
These corpo models are so sensitive, it's very easy to run into the censor by accident. Unsloth and Bartowski are great and all, but I don't use them because I flip out if my computer starts lecturing me or dances around a legit question, so I'm not risking it.
No-Setting8461@reddit
personally I dont use heretic cus (1) I dont have a need for uncensored (academic stuff only) and (2) KLD/PPL don't capture the full picture of what un-censoring might be doing to the model.
WhoRoger@reddit
From my totally unscientific testing, it seems Heretics tend to hallucinate marginally less. But it's hard to say where the difference is exactly, since people tend to quant one or the other. Either way, quants from different providers definitely differ way more, never mind different quants.
CircularSeasoning@reddit
Thanks for doing this.
silenceimpaired@reddit
Shame no creative writing /editing testing was done, but I know I’m a minority
brown2green@reddit
This looks like it's a significant finding. Most people assume Q8_0 to be virtually the same as BF16.
Due-Memory-6957@reddit
Haven't we seen that this isn't the case since Llama 3? I remember a lot of discussions around them about how as the training data size increases, more we start to lose when we finetune and quant.
ambient_temp_xeno@reddit
Back in the day it did seem like q8 quants were 'virtually lossless' for all practical purposes. Longer contexts and higher model temps probably changed this, although it might still be practical to not worry about it - people are wanting to quant their KV cache to q8 now after all.
Awkward-Boat1922@reddit
I used to think Q4KL was virtually lossless until I tried aider bench.
No-Refrigerator-1672@reddit
From what I remember, models at Q8 retain their benchmark score within few percents. As most things LLMs do don't have a single correct answer, it's possible that KL low KL divergence is an indication of model choosing another correct answer, not just simply losing the intelligence. But don't quote me on that.
ambient_temp_xeno@reddit
Lower is better in KL divergence. I get what you're saying though, and since reasoning models self-correct, it's probably more likely they can come up with the right answer.
SnooPaintings8639@reddit
But is KL on long document actually a good measure of quality? I am a bit of a noob in ML, but for long generation KL would mean more variance, which is correlated with lower quality, but not necessarily in a linear fashion. I.e. KL might show large divergence but more real-world benchmarks still show near no degradation in more functional tasks scores (which I believe were the basis of q8 equals near lossless)
oobabooga4@reddit (OP)
I was surprised by this as well.
brown2green@reddit
Maybe not so surprising since people mostly do measurements on wikitext with 512 tokens context.
Could we have a graph showing KLD broken down by task, perhaps with the best quantizations for a given size range?
How long are the "long documents" in your dataset?
oobabooga4@reddit (OP)
The longest prompts are around 30k tokens.
Sure, I just updated the blog post with per-category KLD plots. They are at the bottom of the post.
brown2green@reddit
Thanks for the plots. I meant doing something like this: https://i.imgur.com/dOte8Yr.png
I find the data strange, though, because between Q6_K and Q8_0 there's not much difference for all tasks (including Long Documents), so the gap from BF16 is hard to explain.
Septerium@reddit
I am confused now. Some people say KLD is a better metric compared to perplexity because it would not depend on the dataset
oobabooga4@reddit (OP)
KLD depends on the evaluation dataset, same as perplexity. The difference is that KLD measures divergence from a reference model rather than from the training data.
No-Refrigerator-1672@reddit
I believe it is down to the model architecture - it is using sliding window attention in some layers, which may lead to accumulation of errors over context length at low precisions.
oobabooga4@reddit (OP)
The same thing happens for Qwen 3.5 27b (I was in the middle of benchmarking the Qwens when Gemma 4 dropped, so I finished Gemma first, these things take forever to collect).
Far-Low-4705@reddit
this is really good stuff, please keep doing this for new models, we really need to benchmark stuff instead of "trust me bro" AI hype.
also try to keep your data set private, if ppl making the quants get a hold of it then it's useless.
also, i think it is really important to test long context as you did. this is rarely tested, but i think it is the most important since this is considered the "worst case" scenario for this sort of stuff.
and a depth of only 2k tokens is completely unrealistic, especially with agentic work.
Q8_0 showing a KL of 0.45 on long documents is very telling, and honestly aligns with what I've been suspecting this whole time.
Far-Low-4705@reddit
i really dont think that is the case.
ofc it is a large factor, but i think KL divergence has only really traditionally been tested on shorter inputs, not long context, which models tend to fall appart on.
not to mention, dynamic quants only really pay attention to lower contexts as well since again it is far cheaper than checking every depth
MichiruMatsushima@reddit
Wait a minute... \~0.5 at Q4KM?! Shouldn't KLD normally stay below 0.1 for such a non-aggressive quant? I'm pretty sure I've seen like 0.01 - 0.03 for other models, with Q5 getting into 0.00X territory.
oobabooga4@reddit (OP)
People usually benchmark KLD with wikipedia at low contexts. It's a lot easier to score well there.
Awkward-Boat1922@reddit
This is no doubt the first time we've witnessed the true penalties of quantisation.
MichiruMatsushima@reddit
Oh, I see. Well, that doesn't make it any less confusing. I mean, are the GGUFs we have now the best possible? Or is there a room for bringing Q8 to this elusive idea of 'lossless' quantization?
Pentium95@reddit
I wonder how ubergarmin's ik_llama.cpp quants perform here. That would be a very interesting benchmark
Awkward-Boat1922@reddit
If I could upvote this a thousand times... much appreciated!
I don't suppose you've done the same for Qwen3.5?
Fresh_Month_2594@reddit
Amazing work ! I've always just used Q8 but my use cases are mostly long context. May need to reconsider blindly using Q8.
guiopen@reddit
Incredible benchmarks, thank you
ThrowawayProgress99@reddit
Huggingface says Unsloth's gemma-4-31B-it-UD-IQ2_M.gguf is 10.8 GB but I just noticed the download bar says it's only 10 GB. Similar thing happened with their Qwen3.5-27B-UD-IQ3_XXS.gguf, which says it's 11.5 GB, but is 10.7 GB. I chose that Qwen quant because of some graphs that showed it wasn't that bad. I haven't used it extensively but it seems fine to me too.
Between gemma-4-31B-it-UD-IQ3_XXS.gguf and gemma-4-31B-it-UD-Q2_K_XL.gguf, which should I choose? They're both 11.8 GB on Huggingface (while their Qwen GGUFs have the latter at .3 smaller), so probably just \~11 GB on disk. The graph here says the latter is both better and smaller, but I thought higher quant levels were supposed to be better?
RandomTrollface@reddit
They seem basically within margin of error on the benchmarks. I am using the UD_IQ3_XSS personally and it works fine for me. But I haven't tried the Q2_K_XL to compare.
tmvr@reddit
The difference in size is just in the representation - GB vs. GiB:
10.7 GiB = 10 957 MiB = 11 219 763 KiB = 11 489 037 517 bytes = ~11.5 GBNo-Setting8461@reddit
id go for the IQ3. more bits.
hyperdeath666@reddit
This is really good stuff, thanks OP.
Unsloth folks, if you're reading this, I'd love to hear more about UD-Q8_K_XL (the quant I've been using). Does it have other virtues that KL-divergence doesn't capture? Do you have any internal benchmarks of Gemma 4 31B you can share? Not asking this in an accusatory way or anything - really appreciate your work and just looking to learn more.
Beginning-Window-115@reddit
what's the point of using dynamic 8 bit when you could just use normal 8 bit
Separate-Forever-447@reddit
Because 8-bit is not the gold standard, its also a quant, and q8_k_xl retains more information (\~8.5 bits/weight) than 8-bit (8 bits/weight)
Separate-Forever-447@reddit
They do. And they address your specific points...
"lower perplexity or KLD doesn’t necessarily translate to better real-world performance"
https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
danielhanchen@reddit
Hi yes - so sometimes this is a quirk of quantization - sometimes a combination of BF16 and Q8_0 can change dynamics - we're still actively improving the recipe - but not KLD and PPL generally for larger quants is better
Septerium@reddit
I have always preferred Q8_0 over UD-Q8_K_XL
It runs faster, eats less VRAM and I've never seen any slight quality difference between them
StorageHungry8380@reddit
N00b question regarding the final KL number. Presumably the actual tokens that are part of the top-40 tokens varies from prompt to prompt, and may not fully overlap between test model and reference model, so how exactly is the individual KL divergence calculated? And how are the individual KL figures aggregated?
oobabooga4@reddit (OP)
For each token position, both models return their top-40 token log-probabilities. The KL is computed over the union of tokens that appear in either top-40 list. For tokens missing from one list, a floor logprob is used (lowest observed logprob in that distribution minus 2). The per-token KL values are then averaged across all tokens and all prompts.
Top-40 covers virtually all the probability mass in practice, so the approximation error is negligible.
StorageHungry8380@reddit
Thank you for the details. Again, not an expert, but given that hallucinations and similar is a concern, wouldn't it be relevant to include the P99 metric or similar as well as the average? It seems the straight average can suppress if the model sometimes goes completely off the rails.
I'd rather have a model which is mostly just a tad worse, compared to one which is pretty good but occasionally catastrophically bad.
LeonTheTaken@reddit
Beautiful. Can you provide the file for KL divergence test? Also, can you do Qwen3.5-35B-A3B or Qwen3.5-27B next?
oobabooga4@reddit (OP)
Qwen 3.5 is next.
LeonTheTaken@reddit
Cool.
Potential-Gold5298@reddit
Interesting information, thank you. The large dense model tolerates quantization quite well. It would be very interesting to see the results of the Gemma 4 26B-A4B, since MoE is usually more sensitive to quantization. ggml-org and lmstudio-community apparently use the quantizer without iMatrix, which gives the same results for everyone. You can use mradermacher's quants (without i1) since it has a wider selection (from Q2 to Q8), and the result will be the same as ggml-org /lmstudio-community . Especially interested in Q5_K_M, Q4_K_M, and Q6_K.
sine120@reddit
Welp, I guess me and my 16GB VRAM are going Q2 for this one.
xXprayerwarrior69Xx@reddit
what is considered "good" result ? why is it happening mostly on the long documents ?
2022HousingMarketlol@reddit
From the third paragraph of the linked article:
xXprayerwarrior69Xx@reddit
Lower I understand but how low is good ?
2022HousingMarketlol@reddit
Good is subjective really - it's just like using a "smaller" model.
You'll always want to have the largest quant you can host without sacrificing other things like cache and performance.
I tend to do this, find how big of a model I can run. For me its normally around 15gb. I find the model I want to run ( gemma-4-26B-A4B-it-GGUF) see its a 50gb big boy at 16 bit. I need to get a quantization of this model to run it and get the performance needed to make it worth while. As I go down the quantization last from 16->8->6->5 etc the model gets smaller and smaller but gets less precise. Imagine if your accountant could only count to 1000, that could be fine for most day to day transactions but at the end of the ear when you do your taxes he may start making mistakes. It's a similar deal, written word probably wont be a problem. Complex logic problems will start to degrade etc.
xXprayerwarrior69Xx@reddit
Thanks a lot for the detailed explanation !
SSOMGDSJD@reddit
I appreciate you running these tests and specifically calling out the weakness in wiki testing. I ran into that myself when trying to build out a speculative expert prefetch system, and had promising results from wiki data, but it fell apart when I introduced my own API call data to the testing.
However given the diversity of your test sets, we're basically confirming that the least used weights get crushed the most and that it does affect their quality, right? So the smart quants are working as intended?
So really if you're using a quant model for anything besides agentic/scientific workloads, you're going to want a quant specifically tuned for what you are asking the model to do regularly
Also, what sources did you use for the science category?
2022HousingMarketlol@reddit
Thanks for posting this, it's hard to decide sometimes with all the options out there.
a_beautiful_rhind@reddit
Something is still goofed with this model. It's not acting like the API did and I can run up to BF16.
TacGibs@reddit
Are you running it with vLLM in safetensors ?
a_beautiful_rhind@reddit
I'm running in IK/mainline but I am going to have to update vLLM and do that.
RegularRecipe6175@reddit
Excellent work
hajime-owari@reddit
Very informative.
I hope you can do it for the 26B model as well.
dampflokfreund@reddit
Yes please, that would be very interesting. How does an MoE with just 4b active parameters handle quantization?
draconisx4@reddit
Nice ranking on those Gemma quants using KL divergence that's a smart way to spot quality drops early. If you're optimizing for local setups, double-check how these affect inference speed on hardware like consumer GPUs.
dampflokfreund@reddit
Very nice. Always love to see quant comparisons, we need more of them. Good job!
Icy-Degree6161@reddit
Awesome job, thank you
Embarrassed_Soup_279@reddit