Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results
Posted by oobabooga4@reddit | LocalLLaMA | View on Reddit | 67 comments
Posted by oobabooga4@reddit | LocalLLaMA | View on Reddit | 67 comments
Septerium@reddit
So quantizing kv cache is still horrible
stddealer@reddit
KLD only measure how different the models predictions are compared to baseline. Different doesn't necessarily means worse. Comparison of benchmarks scores should give a clearer pictures on how much it actually degrades the models.
I've been using a Q4_K quant of Gemma4 31B with Q8_0 cache quite a lot and it didn't feel that dumb at all, about on par with Q5_K_M quant of Qwen3.5 27B also with Q8 cache (a bit worst at coding but better at manipulating language or understanding real-world situations)
unjustifiably_angry@reddit
copingest cope that ever coped
I swear there are people who would watch John ChatGPT go on stage and tell you cache quantization is harmful and you'd ask him for stats and methodology and if he really knows what he's talking about.
stddealer@reddit
Gemma3 QAT Q4_0 had way worse KLD than a Q4_0 quant of Gemma3, yet it was absolutely the better model and scored much better in benchmarks.
KLD only measures how close it is to the original, not how gootld it is.
Sadman782@reddit
In real world usage for gemma 4, I don't see much degradation (it's far from being killed) after attn rot was introduced for iSWA. Maybe they recover somehow through reasoning? Also, the PPL isn't as different as the KLD
https://github.com/ggml-org/llama.cpp/pull/21513
Note: I'm using IQ4_XS. There's another possibility for lower quants the degradation is lower for KV cache quantization than the BF16, and no one's using BF16 here.
IrisColt@reddit
Exactly my findings: Gemma is able to translate moderately long texts, while Qwen derails. Again, I am using KV Q4_0.
IrisColt@reddit
Loved the article, thanks!
Low88M@reddit
Which are the necessary flags for building llamacpp with full support & optimization for kv cache quants ?
Remove_Ayys@reddit
Comparing the Kullback-Leibler divergence between different models is meaningless and an incorrect use of the metric.
dinerburgeryum@reddit
Great writeup, thank you. I speculate Gemma's degradation is actually related to the decision to continue to quantize the SWA cache. The team had initially made the decision to keep SWA in 16-bit always, but backed it out. I would be genuinely curious to know how that decision impacts real downstream matching and tasks.
Witty_Mycologist_995@reddit
Any way to disable SWA quantization?
dinerburgeryum@reddit
Not to my knowledge... there was chatter in the PR about allowing SWA quantization separately from the global attention, but I don't think anything came of it.
fragment_me@reddit
The early release of llama CPP that supported the model actually had SWA KV at f16 and you couldn't quant it. Then a few patches later they made it match the KV cache quant.
dinerburgeryum@reddit
Yea, that's what I'm referring to. There was talk of making a separate command line argument to specify the SWA datatype to compensate for the change but it never happened, which is IMO a bummer
-Ellary-@reddit
Agree, It is a known problem for all Gemma 4 models.
walmis@reddit
Curious what KLDs would be with TurboQuant method
draconic_tongue@reddit
only tested qwen 3.6 27b so far but
fragment_me@reddit
Why not show q8_0 K and 4_0 V?
draconic_tongue@reddit
I'll add it in a bit
fragment_me@reddit
I appreciate you responding on adding the status. I don't necessarily agree with the way you have ordered the data to show q8/t4 on top of q8/q4. When I ran my own tests with Q3.5 27B the llama cpp fork of turbo quant failed to do anything meaningful in the benchmarks. Yes, I know your benchmarks are for Q3.6. On another note, I highly recommend a second 3090. Running this model Q8 with KV Q8/Q8 is amazing.
draconic_tongue@reddit
it's there
DeepOrangeSky@reddit
Given that Qwen3.6 27b only just came out a couple days ago, and how common it is for brand-new models to have all kinds of issues that only end up getting discovered and cleared over the course of the first couple weeks or so in many cases, it might be a good idea to test it on a model that isn't so ultra brand new.
So, if wanting to test on a model as similar as possible to that one, but without it being extremely brand new, then Qwen3.5 27b could be a good one to test, since that one has been out long enough and used heavily enough that its bugs are probably all worked out by now if it had any in its first few quant iterations.
Also might be worth intentionally trying 1 or 2 models that, in addition to not being ultra brand-new, are intentionally not identical/nearly-identical in architecture or size, so, maybe trying something like Mistral Nemo 12b (different size, and although also a dense model, still a somewhat different architecture in some aspects I think), and maybe something like OSS 20b (not as different in size, but very different in architecture and certain key aspects), if curious whether things like size or architecture stuff has a significant effect on how well a model handles KV quantization.
Although, from what I've seen the most interesting things I've seen about KV degradation from KV quantization has been when people have noticed severe degradation when using it in long context situations in real world use even when the KLD implied that there should be hardly any degradation. Seems like there is some compounding effect where even seemingly tiny KLD still destroys its long context ability when actually using it, to a surprisingly severe degree. So, doing a test of some sort that tests its long context ability in actual use, might also be pretty important/useful (arguably more so than the KLD scores) if you can come up with some tests that are quantifiable for real-world long-context use. (Either needle-in-haystack formal benchmark of some sort, or, just having a default story with characters with some default names you use and re-use the same exact story each time as a benchmark test, where it is set up in such a way that if it gets damaged from the KV quantization, then it confuses the names of the various characters and who did what in the plot, much earlier in token-count size than an undamaged, un-KV-quantized version by comparison, for example).
Guilty_Rooster_6708@reddit
If possible can you test the Gemma 24B MoE as well? It looks lobotomized at Q4 cache rn
DinoAmino@reddit
It's supposed to have no impact on accuracy. You'll get to enjoy that same degradation at that longer ctx that you weren't able to experience before.
jkflying@reddit
Based on that it looks like a Q8 cache for Qwen should be the default
vevi33@reddit
nah, if you check the long context divergence, that is pretty significant. If you are coding with agents with high context, you will see the difference unfortunately :/
iMakeTea@reddit
Good looking out.
I planned to use Q8 KV to fit Q6 qwen3.6 and 150k context with 4 qwen agents for math and medical analysis. I don't need coding. I'll stick with full KV cache then.
Boxkillor@reddit
And fp16 on gemma4
TechnoByte_@reddit
That is the default
nmqanh@reddit
If I have an M2 Max 96gb and run Qwen 3.6 27B 4 bit. I still have plenty of Vram available (50gb). Is there any benefit to turn on Turbo Quant 8 bit or should I turn it off?
seamonn@reddit
So Gemma starts getting Brain Damage on cache quantization
-Ellary-@reddit
Looks like Gemma 4 26b MoE getting noticeable damage even at Q8K GGUF weights.
Miserable-Dare5090@reddit
no this is for cache, not the model.
-Ellary-@reddit
I get that, check KL of weights:
0.530 KL for Q8KXL gguf and 1.077 for IQ4_XS
Miserable-Dare5090@reddit
yeah the 26B model is not the best. Now considering not running 31B model at all, which is running bc you can use Gemma4 E2B as a spec decoder for it and get 40tps out while fitting all in 24GB
DinoAmino@reddit
At this point it seems clear enough why Google published the unoriginal TurboQuant paper in advance of the Gemma 4 release
dampflokfreund@reddit
Please stop this meme. So far, there is not a single piece of evidence that suggests turbo quants are higher quality than what we have right now.
DinoAmino@reddit
But I never said anything about accuracy, did I? Look in my history. I participate in dispelling the myth. The original accuracy and it's degradationa are primarily preserved. TurboQuant let's you enjoy that increased degradation that high context delivers.
Ell2509@reddit
They read your comment and replied in a hurry. Who cares what you were saying? They had a point to make. /s
Glum-Atmosphere9248@reddit
And... What numbers are supposed to ve high, as of, bad in absolute terms? Like, is 1.088 a number that actually translates to how bad results?
oobabooga4@reddit (OP)
My personal rule of thumb is that anything above 0.1 is high.
hdmcndog@reddit
But then, according to the unsloth benchmarks, all Quants are pretty bad? If I read https://unsloth.ai/docs/models/gemma-4#unsloth-gguf-benchmarks correctly, it says that UD-Q4_K_S is pretty much exactly at 1.0, Q8_0 is at about 0.5? That seems weird… or am I misreading something?
u23043@reddit
If you use 10\^-1 KLD as the metric, Q4_K_M and better are the line for Gemma 4. This seems in line with typical recommendations for pretty much all models
hdmcndog@reddit
I don’t think I’m following what you are trying to say. Could you expand on that a bit?
Stainless-Bacon@reddit
Which one is more sensitive, K or V? maybe it is worth it to use K q8 and V q4?
thirteen-bit@reddit
Check this issue, there are nice plots there in comments https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4140922150
Scroll down too, there are multiple.
grumd@reddit
That's super useful, thanks!
I always was curious if KLD gets worse with larger context length. I think you've mentioned you did around 30k context across different tasks. I wonder how different the results are at 100k, 200k?
oobabooga4@reddit (OP)
Yeah so far the largest prompts are around 30k tokens. 100k+ is a subject for future experiments :)
kernelhunter92@reddit
Yes please, I'm very curious how the KLD changes for 100k+ contexts.
Thanks for taking time to run these numbers though, I've been reading about quantizing the KV cache, but I was hesitant to do it due to anecdotal evidence, especially for programming. I'm on Qwen 3.6, so your numbers make a good case for Q8 KV, at least for smaller contexts.
GodComplecs@reddit
Please test it, 70k-100k seems ti be todays minimum working on a smaller coding project anyway
Sticking_to_Decaf@reddit
Ouch. That’s a big difference. It’s especially rough since it seems like Gemma uses a lot more vram for the same cache as Qwen, at least at FP8.
popoppypoppylovelove@reddit
Great info, thanks! I actually asked about this recently:
While the plots show that Qwen 3.6 27B is quite good using Q8_0 KV cache for coding, the results for "long docs" is more concerning, given that long here is still quite small at ~30k and agentic coding (for me) goes well beyond that.
Would the recommendation here be, when working with long contexts (> 30k), it's better to keep a f16 KV cache and use a more heavily quantized model?
oobabooga4@reddit (OP)
It would be ideal to test the full cartesian product of (model quants) x (kv cache quants) on an expanded dataset including 100k-200k tokens long coding sessions. I'm honestly bottlenecked by compute but that's what I'd like to go towards.
RegularRecipe6175@reddit
Awesome. Thank you.
cleversmoke@reddit
Thanks for the great analysis!
beneath_steel_sky@reddit
Thanks mr. Ooba, you always provide great benchmarks & great tools
Free-Combination-773@reddit
I wonder what results would be with q5_1 or q5_0. I am using Qwen 3.6 27b UD-Q4_K_XL with q5_1 kv cache and it looks fine to me, however "looks fine to me" is not very precise
keyboardhack@reddit
The attention rotation that llama.cpp has implemented was not inspired by turboquant.the inspiration is from here https://github.com/ggml-org/llama.cpp/issues/6444#issuecomment-2042194785
Long before turbo quant even existed. GG links to it here. https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4148371881
Seems like the implementation was done because turboquant renewed interest but that is about it.
oobabooga4@reddit (OP)
You are right, the rotation idea was already present in ExLlamaV2 in 2024 too (see this comment from me and the turboderp comment below).
Still, ggerganov explicitly said
So I guess it's fair to say the change was 'inspired by turboquant' (by momentum, not originality).
LegacyRemaster@reddit
spicy
Velocita84@reddit
I have a question: did you compute kld for all tokens in your datasets or only the ones in assistant turns? I'm using your methodology to test different imatrix calibrations (thanks for the llama.cpp fork btw) and i've observed that gemma 4's distributions are extremely chaotic and nonsensical outside of where it's actually expected to output tokens, much more so than other instruct models
oobabooga4@reddit (OP)
KL is computed over all tokens (both user and assistant turns). Interesting point about Gemma's distributions outside assistant turns, I'll look into it.
ResidentPositive4122@reddit
Can you share more details about the dataset? I looked in the "methodology" link, but it just describes the distribution, not the source. Is it internal or public? If public, is it old (and potentially in the training set)?
I've seen this pattern before, where even under extreme quants (2bit) qwen models scored very close to bf16 on some benchmarks. That shouldn't happen, unless...
oobabooga4@reddit (OP)
The dataset is private, but it's based on public sources, yes. I don't think contamination is a concern because I'm comparing the same model against itself with different cache precision, not scoring the model's answers.
ResidentPositive4122@reddit
Would you be able to test with some content that isn't public at all? Like your own writing / etc? It would be really interesting to see the difference.
Acu17y@reddit
Very thanks, much interesting :))
bonobomaster@reddit
Super interesting!
Thanks for the effort and sharing!