Qwen3.6 27B's surprising KV cache quantization test results (Turbo3/4 vs F16 vs Q8 vs Q4)

Posted by imgroot9@reddit | LocalLLaMA | View on Reddit | 58 comments

I've been using Qwen3.6-27B-Q5_K_M with turbo3 KV cache since it's been released, and I haven't had any issues at all (no loops, no memory loss, etc.). However, I'm also aware that V cache compression is not really recommended in most cases.

So I wanted to check how it is possible and I learned that llama-perplexity.exe is the right tool for this test. I'm using TheTom's turboquant_plus built on my machine - AFAIK you can download a pre-built release by now as well. I have a 3090 eGPU and using 200k context.

This is how I used the tool:

First I executed in without KV cache quantization (PowerShell):\ .\llama-perplexity.exe -m models/unsloth/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q5_K_M.gguf -f wiki.test.raw\ After around 7-8 minutes, it will give you a result something like Final estimate: PPL = 6.9233 +/- 0.04564

Then you can repeat it with your qant values, like\ .\llama-perplexity.exe -m models/unsloth/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q5_K_M.gguf -f wiki.test.raw --cache-type-k turbo3 --cache-type-v turbo3

(wiki.test.raw is just a test file well suited for this test, you can download it from anywhere)

And the results were something I not expected at all. All quants are performing well within the limits. Since I'm quite new to local LLMs, I tried to understand how it was possible and as far as I could understand, if you have a dense model above 20B params and above Q4, then it is intelligent enough to be less sensitive to QV cache quants. I can confirm, that turbo3 was not working well for me with 35B and also, probably all small models would be totally confused with a highly compressed V cache.

Let me switch to AI from now on, since I pasted my results to Gemini and it come up with a nicely formatted post idea based on our conversation and I'm happy to use it, since English is not my first language.

What is Perplexity (PPL)?

For those new to benchmarking, Perplexity is a measure of how "surprised" a model is by a sequence of text. * Lower is better. * A score under 10.0 on Wikitext is generally the mark of a very coherent, "smart" model. * We are looking at the Delta (change). If a quantization setting increases PPL by more than 0.1–0.2, you’ll likely start seeing "drunken" behavior or loops in long conversations.

Results

The results blew me away. The "common wisdom" that Q4 is unusable appears to be a myth for the 27B+ dense class.

KV Cache Setting	Perplexity (PPL)	Delta vs. F16	Verdict
F16 (Baseline)	6.9233	-	Reference
Q8_0	6.9193	-0.0040	Identical (Margin of Error)
Q4_0	6.9381	+0.0148	Transparent (Highly Recommended)
Turbo4 (4-bit)	6.9483	+0.0250	Excellent
Turbo3 (3-bit)	7.0121	+0.0888	Great for Extreme Context

Observations & Recommendations

1. The Q4 "Sweet Spot" The jump from F16 to Q4_0 is only 0.014. To put that in perspective, the margin of error for the test was 0.045. This means Q4_0 is mathematically indistinguishable from uncompressed cache. If you aren't using Q4 or Q8 on a 3090, you're just wasting VRAM.

2. When to use Turbo3? I’ve been using Turbo3 for a week in programming tasks. It allows for a 200k context window on a single 3090 without breaking a sweat. While the PPL hit is measurable (+0.08), it's still well within the "safe zone."

3. The MoE Exception While this dense 27B model handles Turbo3 perfectly, I noticed that 35B MoE models tend to loop or error out with 3-bit cache. It seems the "Router" in MoE architectures is much more sensitive to the noise introduced by heavy quantization.

The "Needle in a Haystack" Test

To be 100% sure your setup is safe for production work, try this "Needle in a Haystack" test: 1. Paste a long piece of code (e.g., 50k tokens). 2. In the middle, hide a very specific, weird comment like // The password is: BANANA-123. 3. Ask the model: "What was the hidden password in the code I gave you?" 4. If it finds it instantly, your 200k context is working perfectly.

TL;DR: Don't fear KV quantization on 27B+ models. Q4_0 is a "free lunch," and Turbo3 is a game-changer for repo-level coding if you need the 200k+ context.

[-]

fragment_me@reddit

Here friend, you can run this to also get KLD.

/home/user/llm/llama.cpp/build/bin/llama-perplexity -m /home/user/llm/models/Qwen3.5-27B/Qwen3.5-27B-UD-Q4_K_XL.gguf -f /home/user/llm/wikitext-2-raw/wiki.test.raw -t 8 -c 512 -fa on --cache-type-k f16 --cache-type-v f16 --no-mmap -ngl 999 --kl-divergence-base /home/user/llm/models/Qwen3.5-27B/Qwen3.5-27B-UD-Q4_K_XL-f16.logits.bin

Final estimate: PPL = 6.9606 +/- 0.04552

/home/user/llm/llama.cpp/build/bin/llama-perplexity -m /home/user/llm/models/Qwen3.5-27B/Qwen3.5-27B-UD-Q4_K_XL.gguf -f /home/user/llm/wikitext-2-raw/wiki.test.raw -t 8 -c 512 -fa on --cache-type-k q8_0 --cache-type-v q8_0 --no-mmap -ngl 999 --kl-divergence --kl-divergence-base /home/user/llm/models/Qwen3.5-27B/Qwen3.5-27B-UD-Q4_K_XL-f16.logits.bin

Notice the second command has an extra parameter: --kl-divergence

You should get output like this:

PLEASE NOTE THIS IS NOT THE RESULT OF THE LAST COMMAND JUST AN EXAMPLE OF WHAT IT WILL LOOK LIKE
====== Perplexity statistics ======

Mean PPL(Q) : 6.961169 ± 0.045531

Mean PPL(base) : 6.861779 ± 0.044615

Cor(ln(PPL(Q)), ln(PPL(base))): 99.62%

Mean ln(PPL(Q)/PPL(base)) : 0.014381 ± 0.000572

Mean PPL(Q)/PPL(base) : 1.014485 ± 0.000580

Mean PPL(Q)-PPL(base) : 0.099391 ± 0.004048

====== KL divergence statistics ======

Mean KLD: 0.014832 ± 0.000481

Maximum KLD: 20.104038

99.9% KLD: 1.460476

99.0% KLD: 0.121376

95.0% KLD: 0.032988

90.0% KLD: 0.019502

Median KLD: 0.004123

10.0% KLD: 0.000134

5.0% KLD: 0.000039

1.0% KLD: 0.000005

0.1% KLD: -0.000000

Minimum KLD: -0.000050

====== Token probability statistics ======

Mean Δp: -0.209 ± 0.009 %

Maximum Δp: 99.423%

99.9% Δp: 20.815%

99.0% Δp: 6.874%

95.0% Δp: 3.051%

90.0% Δp: 1.741%

75.0% Δp: 0.332%

Median Δp: -0.006%

25.0% Δp: -0.573%

10.0% Δp: -2.265%

5.0% Δp: -3.837%

1.0% Δp: -9.420%

0.1% Δp: -30.138%

Minimum Δp: -99.576%

RMS Δp : 3.343 ± 0.059 %

Same top p: 95.581 ± 0.053 %

[-]

imgroot9@reddit (OP)

thanks, updated the post

vevi33@reddit

Did you do benchmarks on long context? Above 100k? I only experience issues with KV cache quantanization even Q8 when the context grows.

Mart-McUH@reddit

wiki-test is maybe too common to be a good test (eg it will be better preserved than more outliner texts). Another problem is, that I think the test is only done in short prompts, like \~1k tokens or so? The KV quantization is felt mostly with long contexts and also understanding subtle relations/subtext within context. Most benchmarks do not measure this.

In short - this is not to challenge the results, but the test is probably not best to show the detrimental effects.

Great job updating the post and following up. I have have two pieces of construction criticism: 1. Stop using LLM for making your post, maybe just use it for the tables. 2. Run your benchmarks multiple times (probably need like 3-5 runs) for results to be meaningful.

thetaFAANG@reddit

How do you guys even like this model, it repeats itself and does amateurish things

Is this just a benchmarking cult?

you may have misconfigured something. zero repeat for me and it's absolutely the best coding model ever that fits in 24GB vram. not the best as an arcitect, that's true, but when something is fairly complex, I create the plan with a cloud model, then implement it locally.

I have it running in omlx and opencode, only misconfigurations would be default settings

666666thats6sixes@reddit

The current version of omlx is broken, there are multiple unresolved regressions. Downgrading to omlx-3.6 fixes them, including the looping.

Betadoggo_@reddit

PPL and KLD are no longer good references for quality loss as shown in the PR that added activation rotation. Q4 kv shows a minimal loss in both metrics but actually causes a huge dropoff in AIME even after the PR which improved it significantly.

https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4150413357

IrisColt@reddit

This.

Fit_Split_9933@reddit

Does this mean the Q5 is a good choice?

It seems to be more model specific (the effect size), but definitely right that the stats aren't everything. Check this table out from the official SLANG website:

https://docs.sglang.io/docs/advanced_features/quantized_kv_cache

jubilantcoffin@reddit

The results you post are FP4 not Q4.

That's a fair point, but I Think it still points to something related.

I haven't seen this before, thanks for pointing it out. It's interesting that AIME seems to be hit the hardest regardless of model.

Ok-Measurement-1575@reddit

How can kld not be a good reference point, though?

It is. The argument is really insanely stupid: the dev picked the absolute worst case to more easily demonstrate improvements and now it's being argued as if this is the typical case.

This is a stupid argument because the developer intentionally picked this as a known worst case.

Thank you! Let me try to test this today and add it to the post.

Old-Sherbert-4495@reddit

it def does it's job and saved vram for me but at a brutal cost of performance.

no noticable change for me. I have a thunderbolt 3090 egpu, so I keep everything in vram to avoid using that connection. my card is also limited to 250w so there's no noise. and a get around 25-28 t/s no matter what. prefill is also more or less the same. of course, it gets a bit slower with big context, but that's expected.

MmmmMorphine@reddit

I thought it was K that shouldn't be compressed and V should be the target?

the general rule: K is more compressible without hurting output quality.

No worries, just wasn't sure whether I got mixed up myself

Ranmark@reddit

I've tried to download tom's release of turboquant plus, but it doesn't seem to work for me. I try to run a model via command that works on mainline llama.cpp (with turbo4 on v-cache is the only difference) but it just doesn't run, no errors. Maybe it has something to do with my old hardware (GTX 1080 ti + RTX 2060 super)

TheRenegadeKaladian@reddit

Im doing back to back comparison on theToms branch and main branch, Did you also try on ik_llama? Im getting more performance on ik_llama actually.

dodistyo@reddit

Thanks for this man! I always use q4 for KV cache because i need to have enough room to do the actual work.

did you test long running coding session with that 200k? local model that size tends to degrade in performance when getting to the end of the window.

In my experience,use q4 for KV can reduce the speed of PP by about ten times because it using CPU instead of GPU, what about you?

I usually offload everything to GPU for speed.

What I mean is, even with enough VRAM , and all layers have also offloaded to GPU, using q4 for KV will force the CPU doing PP

Well i honestly don't know about that and bot so sure either.

Sticking_to_Decaf@reddit

My agent tends to run context compression at about 120k tokens. Did fine up until that but context compression gets messy after a couple rounds of compression

which quant did you use? also i haven't tried turbo3, I wonder how does it compares with q4.

FP8 both for the model and for kv cache. Model FP8 is the one released by Qwen. I am very cautious about quants because the specific settings used when creating a quant can matter a lot more than q4 vs q6 vs q8 vs fp8 vs nvfp4 etc. If the person making the quant doesn’t know what they are doing or isn’t careful it’s going to be messed up.

Anbeeld@reddit

Is it just me or enabling any KV cache quantization makes everything slow as hell, especially prefill? I have 5700X3D and 3090.

tmvr@reddit

Someone else with a 3090 wrote this a few days ago and I've checked and I don't see a huge dropoff with a 4090 and the CUDA 12 version of b8733 llamacpp. I get about 8-9% drop at 128K+ context and 4-5% at 64K when switching to q8_0 from FP16.

fiddlerwoaroof@reddit

I’ve had to watch llama.cpp logs for this: in some cases, I get graph splits because the quantization isn’t supported on-GPU and so the data has to be shipped back to main memory and quantized on the CPU

Finanzamt_Endgegner@reddit

PPL is important but we should also test kdl, but i really hope this is true, it seems to be exceptionally error resistant with quantization of the weights already 🤯

Middle_Bullfrog_6173@reddit

KLD is important but you should really test accuracy on actual tasks.

Small PPL and KLD can still fail reasoning.

Yeah ofc, the more testing the better 😉

agree, but I assume if the PPL is very low, it's a pretty safe bet that KDL is also minimal

Digger412@reddit

That's not necessarily true. PPL is the measure of surprisal at the next token, and considers only the most probable token. KLD measures the diff in distribution across the entire vocab.

A quote from the YAQA paper that has my favorite definition for PPL vs KLD:

While the full Model KL divergence and perplexity are conceptually related, they measure two fundamentally different quantities. The KL measures the difference between two distributions and is defined over the full support of the distributions [...]. The perplexity measures the mass on a “ground truth” target τ in single probability distribution p: 1/p(τ ). As such, two models can have very similar perplexities but be completely different from each other. For example, Llama 1 7B has a Wikitext 2 perplexity of 5.68 and Llama 2 7B has a perplexity of 5.47, but were pretrained from scratch separately. Indeed, their KL divergence is 0.197, which is much higher than what the difference in perplexity would suggest (log (5.68 / 5.47) = 0.038)

TomLucidor@reddit

run Vectera and other hallucination benchmarks please

leonbollerup@reddit

Is there some page where optimal settings for models get collected, or should we build something ?

admajic@reddit

Do you find that one you get close to 180k context. The tokens/s is half the initial speed?

EbbNorth7735@reddit

I literally just tried turbquant in vllm and it told me it couldn't be used with Qwens architecture. Does anyone know if CoPilot lied about what command to use? Can it be done with vllm?

Velocita84@reddit

A certain ppl score on wikitext doesn't mean anything. Gemma 4 scores in the thousands and works just fine.

FullOf_Bad_Ideas@reddit

That's because of baked in chat templates.

You wouldn't be able to use it without them.

Uh, yeah? Same goes for any other model, i'm saying raw ppl score on wikitext is irrelevant. Delta is ok if the base score isn't ridiculously high like with gemma 4, but kld is still better.

not all models have chat templates baked in to the same degree and they can have normal perplexity even when they're instruct-tuned.

, i'm saying raw ppl score on wikitext is irrelevant.

I think it's irrelevant only when it's ridiculously high on a model that clearly works well for chat.

Ok, i think i misunderstood what you were trying to say. Did you mean that gemma 4 specifically is overtrained for instruct at the cost of nonsensical raw text performance? Because yes, that's true

Yes, all instruct models have chat template tokens trained in, but if they respond to perplexity test that doesn't follow their chat template by completely collapsing, it's unusual. Most models do have reasonable perplexity regardless of those chat template tokens. Maybe it's some form of securing the model, maybe it's a bug. But I don't think this invalidates perplexity as a metric of output quality of LLMs that don't show this behavior.

added a comment to the post, thanks

hectaaaa@reddit

Commenting to get updates on this, seems interesting!

BringMeTheBoreWorms@reddit

Have you been using latest release of llamacpp? Optimisations went in early April based on turboquant that make q8 and q4 much less lobotomising.

I think q8 with llamacpp is pretty save to use as a default for most setups now.

Trouble with turboquant is that you have to use a build which is not up to latest llamacpp.

turboquant is also refreshed, so it is not much behind. but the main takeaway here is that q4 is almost identical to f16 if you have a 20B+ dense model.

Yeah it’s pretty damn good.. I’ll be using q8 for a while before I trust q4 on everything though. Bigger broader testing might show where it starts to fail