Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results

[-]

Septerium@reddit

So quantizing kv cache is still horrible

[-]

KLD only measure how different the models predictions are compared to baseline. Different doesn't necessarily means worse. Comparison of benchmarks scores should give a clearer pictures on how much it actually degrades the models.

I've been using a Q4_K quant of Gemma4 31B with Q8_0 cache quite a lot and it didn't feel that dumb at all, about on par with Q5_K_M quant of Qwen3.5 27B also with Q8 cache (a bit worst at coding but better at manipulating language or understanding real-world situations)

[-]

unjustifiably_angry@reddit

it's not wrong, just different from the correct result

copingest cope that ever coped

I swear there are people who would watch John ChatGPT go on stage and tell you cache quantization is harmful and you'd ask him for stats and methodology and if he really knows what he's talking about.

[-]

stddealer@reddit

Gemma3 QAT Q4_0 had way worse KLD than a Q4_0 quant of Gemma3, yet it was absolutely the better model and scored much better in benchmarks.

KLD only measures how close it is to the original, not how gootld it is.

[-]

Sadman782@reddit

In real world usage for gemma 4, I don't see much degradation (it's far from being killed) after attn rot was introduced for iSWA. Maybe they recover somehow through reasoning? Also, the PPL isn't as different as the KLD

https://github.com/ggml-org/llama.cpp/pull/21513

Note: I'm using IQ4_XS. There's another possibility for lower quants the degradation is lower for KV cache quantization than the BF16, and no one's using BF16 here.

[-]

IrisColt@reddit

Gemma degrades uniformly: even its best category at q8_0 (science, KL 0.214) is worse than Qwen’s worst (long docs, KL 0.142). Qwen concentrates nearly all damage in long documents (KL 0.581 at q4_0) and tool calling (0.086), with other categories staying near zero.

Exactly my findings: Gemma is able to translate moderately long texts, while Qwen derails. Again, I am using KV Q4_0.

[-]

IrisColt@reddit

Loved the article, thanks!

[-]

Low88M@reddit

Which are the necessary flags for building llamacpp with full support & optimization for kv cache quants ?

[-]

Remove_Ayys@reddit

Comparing the Kullback-Leibler divergence between different models is meaningless and an incorrect use of the metric.

[-]

dinerburgeryum@reddit

Great writeup, thank you. I speculate Gemma's degradation is actually related to the decision to continue to quantize the SWA cache. The team had initially made the decision to keep SWA in 16-bit always, but backed it out. I would be genuinely curious to know how that decision impacts real downstream matching and tasks.

[-]

Witty_Mycologist_995@reddit

Any way to disable SWA quantization?

[-]

dinerburgeryum@reddit

Not to my knowledge... there was chatter in the PR about allowing SWA quantization separately from the global attention, but I don't think anything came of it.

[-]

fragment_me@reddit

The early release of llama CPP that supported the model actually had SWA KV at f16 and you couldn't quant it. Then a few patches later they made it match the KV cache quant.

[-]

dinerburgeryum@reddit

Yea, that's what I'm referring to. There was talk of making a separate command line argument to specify the SWA datatype to compensate for the change but it never happened, which is IMO a bummer

[-]

-Ellary-@reddit

Agree, It is a known problem for all Gemma 4 models.

[-]

walmis@reddit

Curious what KLDs would be with TurboQuant method

[-]

draconic_tongue@reddit

only tested qwen 3.6 27b so far but

| Config | Context | CUDA free MiB | CUDA context MiB |
| :-- | --: | --: | --: |
| `q8_0/q8_0` | `131072` | `1527` | `4501` |
| `q8_0/turbo4` | `131072` | `2458` | `3413` |
| `q8_0/turbo4` | `196608` | `958` | `5045` |
| `q4_0/q4_0` | `262144` | `1009` | `4757` |

| Config | Mean KLD | PPL ratio vs BF16 | RMS Δp | Same top-p |
| :-- | --: | --: | --: | --: |
| `q8_0/q8_0` | `0.038583` | `1.014545` | `4.496%` | `95.095%` |
| `q8_0/turbo4` | `0.045081` | `1.017086` | `5.023%` | `94.432%` |
| `q8_0/turbo3` | `0.047672` | `1.018680` | `5.298%` | `94.100%` |
| `q4_0/q4_0` | `0.049557` | `1.020552` | `5.090%` | `94.382%` |
| `turbo4/turbo4` | `0.052254` | `1.021878` | `5.257%` | `94.039%` |
| `turbo3/turbo3` | `0.062166` | `1.029845` | `5.871%` | `93.437%` |

[-]

fragment_me@reddit

Why not show q8_0 K and 4_0 V?

[-]

draconic_tongue@reddit

I'll add it in a bit

[-]

fragment_me@reddit

I appreciate you responding on adding the status. I don't necessarily agree with the way you have ordered the data to show q8/t4 on top of q8/q4. When I ran my own tests with Q3.5 27B the llama cpp fork of turbo quant failed to do anything meaningful in the benchmarks. Yes, I know your benchmarks are for Q3.6. On another note, I highly recommend a second 3090. Running this model Q8 with KV Q8/Q8 is amazing.

[-]

draconic_tongue@reddit

it's there

[-]

DeepOrangeSky@reddit

Given that Qwen3.6 27b only just came out a couple days ago, and how common it is for brand-new models to have all kinds of issues that only end up getting discovered and cleared over the course of the first couple weeks or so in many cases, it might be a good idea to test it on a model that isn't so ultra brand new.

So, if wanting to test on a model as similar as possible to that one, but without it being extremely brand new, then Qwen3.5 27b could be a good one to test, since that one has been out long enough and used heavily enough that its bugs are probably all worked out by now if it had any in its first few quant iterations.

Also might be worth intentionally trying 1 or 2 models that, in addition to not being ultra brand-new, are intentionally not identical/nearly-identical in architecture or size, so, maybe trying something like Mistral Nemo 12b (different size, and although also a dense model, still a somewhat different architecture in some aspects I think), and maybe something like OSS 20b (not as different in size, but very different in architecture and certain key aspects), if curious whether things like size or architecture stuff has a significant effect on how well a model handles KV quantization.

Although, from what I've seen the most interesting things I've seen about KV degradation from KV quantization has been when people have noticed severe degradation when using it in long context situations in real world use even when the KLD implied that there should be hardly any degradation. Seems like there is some compounding effect where even seemingly tiny KLD still destroys its long context ability when actually using it, to a surprisingly severe degree. So, doing a test of some sort that tests its long context ability in actual use, might also be pretty important/useful (arguably more so than the KLD scores) if you can come up with some tests that are quantifiable for real-world long-context use. (Either needle-in-haystack formal benchmark of some sort, or, just having a default story with characters with some default names you use and re-use the same exact story each time as a benchmark test, where it is set up in such a way that if it gets damaged from the KV quantization, then it confuses the names of the various characters and who did what in the plot, much earlier in token-count size than an undamaged, un-KV-quantized version by comparison, for example).

[-]

Guilty_Rooster_6708@reddit

If possible can you test the Gemma 24B MoE as well? It looks lobotomized at Q4 cache rn

[-]

DinoAmino@reddit

It's supposed to have no impact on accuracy. You'll get to enjoy that same degradation at that longer ctx that you weren't able to experience before.

[-]

jkflying@reddit

Based on that it looks like a Q8 cache for Qwen should be the default

[-]

vevi33@reddit

nah, if you check the long context divergence, that is pretty significant. If you are coding with agents with high context, you will see the difference unfortunately :/

[-]

iMakeTea@reddit

Good looking out.

I planned to use Q8 KV to fit Q6 qwen3.6 and 150k context with 4 qwen agents for math and medical analysis. I don't need coding. I'll stick with full KV cache then.

[-]

Boxkillor@reddit

And fp16 on gemma4

[-]

TechnoByte_@reddit

That is the default

[-]

nmqanh@reddit

If I have an M2 Max 96gb and run Qwen 3.6 27B 4 bit. I still have plenty of Vram available (50gb). Is there any benefit to turn on Turbo Quant 8 bit or should I turn it off?

[-]

seamonn@reddit

So Gemma starts getting Brain Damage on cache quantization

[-]

-Ellary-@reddit

Looks like Gemma 4 26b MoE getting noticeable damage even at Q8K GGUF weights.

[-]

Miserable-Dare5090@reddit

no this is for cache, not the model.

[-]

-Ellary-@reddit

I get that, check KL of weights:

0.530 KL for Q8KXL gguf and 1.077 for IQ4_XS

[-]

Miserable-Dare5090@reddit

yeah the 26B model is not the best. Now considering not running 31B model at all, which is running bc you can use Gemma4 E2B as a spec decoder for it and get 40tps out while fitting all in 24GB

[-]

DinoAmino@reddit

At this point it seems clear enough why Google published the unoriginal TurboQuant paper in advance of the Gemma 4 release

[-]

dampflokfreund@reddit

Please stop this meme. So far, there is not a single piece of evidence that suggests turbo quants are higher quality than what we have right now.

[-]

DinoAmino@reddit

But I never said anything about accuracy, did I? Look in my history. I participate in dispelling the myth. The original accuracy and it's degradationa are primarily preserved. TurboQuant let's you enjoy that increased degradation that high context delivers.

[-]

Ell2509@reddit

They read your comment and replied in a hurry. Who cares what you were saying? They had a point to make. /s

[-]

Glum-Atmosphere9248@reddit

And... What numbers are supposed to ve high, as of, bad in absolute terms? Like, is 1.088 a number that actually translates to how bad results?

[-]

oobabooga4@reddit (OP)

My personal rule of thumb is that anything above 0.1 is high.

[-]

hdmcndog@reddit

But then, according to the unsloth benchmarks, all Quants are pretty bad? If I read https://unsloth.ai/docs/models/gemma-4#unsloth-gguf-benchmarks correctly, it says that UD-Q4_K_S is pretty much exactly at 1.0, Q8_0 is at about 0.5? That seems weird… or am I misreading something?

[-]

u23043@reddit

If you use 10\^-1 KLD as the metric, Q4_K_M and better are the line for Gemma 4. This seems in line with typical recommendations for pretty much all models

[-]

hdmcndog@reddit

I don’t think I’m following what you are trying to say. Could you expand on that a bit?

[-]

Stainless-Bacon@reddit

Which one is more sensitive, K or V? maybe it is worth it to use K q8 and V q4?

[-]

thirteen-bit@reddit

Check this issue, there are nice plots there in comments https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4140922150

Scroll down too, there are multiple.

[-]

grumd@reddit

That's super useful, thanks!

I always was curious if KLD gets worse with larger context length. I think you've mentioned you did around 30k context across different tasks. I wonder how different the results are at 100k, 200k?

[-]

oobabooga4@reddit (OP)

Yeah so far the largest prompts are around 30k tokens. 100k+ is a subject for future experiments :)

[-]

kernelhunter92@reddit

Yes please, I'm very curious how the KLD changes for 100k+ contexts.

Thanks for taking time to run these numbers though, I've been reading about quantizing the KV cache, but I was hesitant to do it due to anecdotal evidence, especially for programming. I'm on Qwen 3.6, so your numbers make a good case for Q8 KV, at least for smaller contexts.

[-]

GodComplecs@reddit

Please test it, 70k-100k seems ti be todays minimum working on a smaller coding project anyway

[-]

Sticking_to_Decaf@reddit

Ouch. That’s a big difference. It’s especially rough since it seems like Gemma uses a lot more vram for the same cache as Qwen, at least at FP8.

[-]

popoppypoppylovelove@reddit

Great info, thanks! I actually asked about this recently:

A related question: is it better to use a Q8_0 model with Q8_0 KV cache or a Q6_K_XL model with f16 KV cache? For Qwen 3.6 27B, these both fit roughly 128k context size on 32 GB VRAM.

While the plots show that Qwen 3.6 27B is quite good using Q8_0 KV cache for coding, the results for "long docs" is more concerning, given that long here is still quite small at ~30k and agentic coding (for me) goes well beyond that.

Would the recommendation here be, when working with long contexts (> 30k), it's better to keep a f16 KV cache and use a more heavily quantized model?

[-]

oobabooga4@reddit (OP)

It would be ideal to test the full cartesian product of (model quants) x (kv cache quants) on an expanded dataset including 100k-200k tokens long coding sessions. I'm honestly bottlenecked by compute but that's what I'd like to go towards.

[-]

RegularRecipe6175@reddit

Awesome. Thank you.

[-]

cleversmoke@reddit

Thanks for the great analysis!

[-]

beneath_steel_sky@reddit

Thanks mr. Ooba, you always provide great benchmarks & great tools

[-]

Free-Combination-773@reddit

I wonder what results would be with q5_1 or q5_0. I am using Qwen 3.6 27b UD-Q4_K_XL with q5_1 kv cache and it looks fine to me, however "looks fine to me" is not very precise

[-]

keyboardhack@reddit

The attention rotation that llama.cpp has implemented was not inspired by turboquant.the inspiration is from here https://github.com/ggml-org/llama.cpp/issues/6444#issuecomment-2042194785

Long before turbo quant even existed. GG links to it here. https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4148371881

Seems like the implementation was done because turboquant renewed interest but that is about it.

[-]

oobabooga4@reddit (OP)

You are right, the rotation idea was already present in ExLlamaV2 in 2024 too (see this comment from me and the turboderp comment below).

Still, ggerganov explicitly said

In anticipation of the incoming flood of vibe generated PRs implementing TurboQuant, I'm raising the baseline a bit using a very simple interpretation of the idea

So I guess it's fair to say the change was 'inspired by turboquant' (by momentum, not originality).

[-]

LegacyRemaster@reddit

spicy

[-]

Velocita84@reddit

I have a question: did you compute kld for all tokens in your datasets or only the ones in assistant turns? I'm using your methodology to test different imatrix calibrations (thanks for the llama.cpp fork btw) and i've observed that gemma 4's distributions are extremely chaotic and nonsensical outside of where it's actually expected to output tokens, much more so than other instruct models

[-]

oobabooga4@reddit (OP)

KL is computed over all tokens (both user and assistant turns). Interesting point about Gemma's distributions outside assistant turns, I'll look into it.

[-]

ResidentPositive4122@reddit

Can you share more details about the dataset? I looked in the "methodology" link, but it just describes the distribution, not the source. Is it internal or public? If public, is it old (and potentially in the training set)?

I've seen this pattern before, where even under extreme quants (2bit) qwen models scored very close to bf16 on some benchmarks. That shouldn't happen, unless...

[-]

oobabooga4@reddit (OP)

The dataset is private, but it's based on public sources, yes. I don't think contamination is a concern because I'm comparing the same model against itself with different cache precision, not scoring the model's answers.

[-]