Qwen3.6 27B's surprising KV cache quantization test results (Turbo3/4 vs F16 vs Q8 vs Q4)
Posted by imgroot9@reddit | LocalLLaMA | View on Reddit | 58 comments
I've been using Qwen3.6-27B-Q5_K_M with turbo3 KV cache since it's been released, and I haven't had any issues at all (no loops, no memory loss, etc.). However, I'm also aware that V cache compression is not really recommended in most cases.
So I wanted to check how it is possible and I learned that llama-perplexity.exe is the right tool for this test. I'm using TheTom's turboquant_plus built on my machine - AFAIK you can download a pre-built release by now as well. I have a 3090 eGPU and using 200k context.
This is how I used the tool:
First I executed in without KV cache quantization (PowerShell):\ .\llama-perplexity.exe -m models/unsloth/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q5_K_M.gguf -f wiki.test.raw\ After around 7-8 minutes, it will give you a result something like Final estimate: PPL = 6.9233 +/- 0.04564
Then you can repeat it with your qant values, like\ .\llama-perplexity.exe -m models/unsloth/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q5_K_M.gguf -f wiki.test.raw --cache-type-k turbo3 --cache-type-v turbo3
(wiki.test.raw is just a test file well suited for this test, you can download it from anywhere)
And the results were something I not expected at all. All quants are performing well within the limits. Since I'm quite new to local LLMs, I tried to understand how it was possible and as far as I could understand, if you have a dense model above 20B params and above Q4, then it is intelligent enough to be less sensitive to QV cache quants. I can confirm, that turbo3 was not working well for me with 35B and also, probably all small models would be totally confused with a highly compressed V cache.
Let me switch to AI from now on, since I pasted my results to Gemini and it come up with a nicely formatted post idea based on our conversation and I'm happy to use it, since English is not my first language.
What is Perplexity (PPL)?
For those new to benchmarking, Perplexity is a measure of how "surprised" a model is by a sequence of text. * Lower is better. * A score under 10.0 on Wikitext is generally the mark of a very coherent, "smart" model. * We are looking at the Delta (change). If a quantization setting increases PPL by more than 0.1–0.2, you’ll likely start seeing "drunken" behavior or loops in long conversations.
Results
The results blew me away. The "common wisdom" that Q4 is unusable appears to be a myth for the 27B+ dense class.
| KV Cache Setting | Perplexity (PPL) | Delta vs. F16 | Verdict |
|---|---|---|---|
| F16 (Baseline) | 6.9233 | - | Reference |
| Q8_0 | 6.9193 | -0.0040 | Identical (Margin of Error) |
| Q4_0 | 6.9381 | +0.0148 | Transparent (Highly Recommended) |
| Turbo4 (4-bit) | 6.9483 | +0.0250 | Excellent |
| Turbo3 (3-bit) | 7.0121 | +0.0888 | Great for Extreme Context |
Observations & Recommendations
1. The Q4 "Sweet Spot" The jump from F16 to Q4_0 is only 0.014. To put that in perspective, the margin of error for the test was 0.045. This means Q4_0 is mathematically indistinguishable from uncompressed cache. If you aren't using Q4 or Q8 on a 3090, you're just wasting VRAM.
2. When to use Turbo3? I’ve been using Turbo3 for a week in programming tasks. It allows for a 200k context window on a single 3090 without breaking a sweat. While the PPL hit is measurable (+0.08), it's still well within the "safe zone."
3. The MoE Exception While this dense 27B model handles Turbo3 perfectly, I noticed that 35B MoE models tend to loop or error out with 3-bit cache. It seems the "Router" in MoE architectures is much more sensitive to the noise introduced by heavy quantization.
The "Needle in a Haystack" Test
To be 100% sure your setup is safe for production work, try this "Needle in a Haystack" test:
1. Paste a long piece of code (e.g., 50k tokens).
2. In the middle, hide a very specific, weird comment like // The password is: BANANA-123.
3. Ask the model: "What was the hidden password in the code I gave you?"
4. If it finds it instantly, your 200k context is working perfectly.
TL;DR: Don't fear KV quantization on 27B+ models. Q4_0 is a "free lunch," and Turbo3 is a game-changer for repo-level coding if you need the 200k+ context.
vevi33@reddit
Did you do benchmarks on long context? Above 100k? I only experience issues with KV cache quantanization even Q8 when the context grows.
Mart-McUH@reddit
wiki-test is maybe too common to be a good test (eg it will be better preserved than more outliner texts). Another problem is, that I think the test is only done in short prompts, like \~1k tokens or so? The KV quantization is felt mostly with long contexts and also understanding subtle relations/subtext within context. Most benchmarks do not measure this.
In short - this is not to challenge the results, but the test is probably not best to show the detrimental effects.
fragment_me@reddit
Great job updating the post and following up. I have have two pieces of construction criticism: 1. Stop using LLM for making your post, maybe just use it for the tables. 2. Run your benchmarks multiple times (probably need like 3-5 runs) for results to be meaningful.
fragment_me@reddit
Here friend, you can run this to also get KLD.
/home/user/llm/llama.cpp/build/bin/llama-perplexity -m /home/user/llm/models/Qwen3.5-27B/Qwen3.5-27B-UD-Q4_K_XL.gguf -f /home/user/llm/wikitext-2-raw/wiki.test.raw -t 8 -c 512 -fa on --cache-type-k f16 --cache-type-v f16 --no-mmap -ngl 999 --kl-divergence-base /home/user/llm/models/Qwen3.5-27B/Qwen3.5-27B-UD-Q4_K_XL-f16.logits.bin
Final estimate: PPL = 6.9606 +/- 0.04552
/home/user/llm/llama.cpp/build/bin/llama-perplexity -m /home/user/llm/models/Qwen3.5-27B/Qwen3.5-27B-UD-Q4_K_XL.gguf -f /home/user/llm/wikitext-2-raw/wiki.test.raw -t 8 -c 512 -fa on --cache-type-k q8_0 --cache-type-v q8_0 --no-mmap -ngl 999 --kl-divergence --kl-divergence-base /home/user/llm/models/Qwen3.5-27B/Qwen3.5-27B-UD-Q4_K_XL-f16.logits.bin
Notice the second command has an extra parameter: --kl-divergence
You should get output like this:
PLEASE NOTE THIS IS NOT THE RESULT OF THE LAST COMMAND JUST AN EXAMPLE OF WHAT IT WILL LOOK LIKE
====== Perplexity statistics ======
Mean PPL(Q) : 6.961169 ± 0.045531
Mean PPL(base) : 6.861779 ± 0.044615
Cor(ln(PPL(Q)), ln(PPL(base))): 99.62%
Mean ln(PPL(Q)/PPL(base)) : 0.014381 ± 0.000572
Mean PPL(Q)/PPL(base) : 1.014485 ± 0.000580
Mean PPL(Q)-PPL(base) : 0.099391 ± 0.004048
====== KL divergence statistics ======
Mean KLD: 0.014832 ± 0.000481
Maximum KLD: 20.104038
99.9% KLD: 1.460476
99.0% KLD: 0.121376
95.0% KLD: 0.032988
90.0% KLD: 0.019502
Median KLD: 0.004123
10.0% KLD: 0.000134
5.0% KLD: 0.000039
1.0% KLD: 0.000005
0.1% KLD: -0.000000
Minimum KLD: -0.000050
====== Token probability statistics ======
Mean Δp: -0.209 ± 0.009 %
Maximum Δp: 99.423%
99.9% Δp: 20.815%
99.0% Δp: 6.874%
95.0% Δp: 3.051%
90.0% Δp: 1.741%
75.0% Δp: 0.332%
Median Δp: -0.006%
25.0% Δp: -0.573%
10.0% Δp: -2.265%
5.0% Δp: -3.837%
1.0% Δp: -9.420%
0.1% Δp: -30.138%
Minimum Δp: -99.576%
RMS Δp : 3.343 ± 0.059 %
Same top p: 95.581 ± 0.053 %
imgroot9@reddit (OP)
thanks, updated the post
thetaFAANG@reddit
How do you guys even like this model, it repeats itself and does amateurish things
Is this just a benchmarking cult?
imgroot9@reddit (OP)
you may have misconfigured something. zero repeat for me and it's absolutely the best coding model ever that fits in 24GB vram. not the best as an arcitect, that's true, but when something is fairly complex, I create the plan with a cloud model, then implement it locally.
thetaFAANG@reddit
I have it running in omlx and opencode, only misconfigurations would be default settings
666666thats6sixes@reddit
The current version of omlx is broken, there are multiple unresolved regressions. Downgrading to omlx-3.6 fixes them, including the looping.
Betadoggo_@reddit
PPL and KLD are no longer good references for quality loss as shown in the PR that added activation rotation. Q4 kv shows a minimal loss in both metrics but actually causes a huge dropoff in AIME even after the PR which improved it significantly.
https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4150413357
IrisColt@reddit
This.
Fit_Split_9933@reddit
Does this mean the Q5 is a good choice?
fragment_me@reddit
It seems to be more model specific (the effect size), but definitely right that the stats aren't everything. Check this table out from the official SLANG website:
https://docs.sglang.io/docs/advanced_features/quantized_kv_cache
jubilantcoffin@reddit
The results you post are FP4 not Q4.
fragment_me@reddit
That's a fair point, but I Think it still points to something related.
Betadoggo_@reddit
I haven't seen this before, thanks for pointing it out. It's interesting that AIME seems to be hit the hardest regardless of model.
Ok-Measurement-1575@reddit
How can kld not be a good reference point, though?
jubilantcoffin@reddit
It is. The argument is really insanely stupid: the dev picked the absolute worst case to more easily demonstrate improvements and now it's being argued as if this is the typical case.
jubilantcoffin@reddit
This is a stupid argument because the developer intentionally picked this as a known worst case.
imgroot9@reddit (OP)
Thank you! Let me try to test this today and add it to the post.
Old-Sherbert-4495@reddit
it def does it's job and saved vram for me but at a brutal cost of performance.
imgroot9@reddit (OP)
no noticable change for me. I have a thunderbolt 3090 egpu, so I keep everything in vram to avoid using that connection. my card is also limited to 250w so there's no noise. and a get around 25-28 t/s no matter what. prefill is also more or less the same. of course, it gets a bit slower with big context, but that's expected.
MmmmMorphine@reddit
I thought it was K that shouldn't be compressed and V should be the target?
imgroot9@reddit (OP)
the general rule: K is more compressible without hurting output quality.
MmmmMorphine@reddit
No worries, just wasn't sure whether I got mixed up myself
Ranmark@reddit
I've tried to download tom's release of turboquant plus, but it doesn't seem to work for me. I try to run a model via command that works on mainline llama.cpp (with turbo4 on v-cache is the only difference) but it just doesn't run, no errors. Maybe it has something to do with my old hardware (GTX 1080 ti + RTX 2060 super)
TheRenegadeKaladian@reddit
Im doing back to back comparison on theToms branch and main branch, Did you also try on ik_llama? Im getting more performance on ik_llama actually.
dodistyo@reddit
Thanks for this man! I always use q4 for KV cache because i need to have enough room to do the actual work.
did you test long running coding session with that 200k? local model that size tends to degrade in performance when getting to the end of the window.
Fit_Split_9933@reddit
In my experience,use q4 for KV can reduce the speed of PP by about ten times because it using CPU instead of GPU, what about you?
dodistyo@reddit
I usually offload everything to GPU for speed.
Fit_Split_9933@reddit
What I mean is, even with enough VRAM , and all layers have also offloaded to GPU, using q4 for KV will force the CPU doing PP
dodistyo@reddit
Well i honestly don't know about that and bot so sure either.
Sticking_to_Decaf@reddit
My agent tends to run context compression at about 120k tokens. Did fine up until that but context compression gets messy after a couple rounds of compression
dodistyo@reddit
which quant did you use? also i haven't tried turbo3, I wonder how does it compares with q4.
Sticking_to_Decaf@reddit
FP8 both for the model and for kv cache. Model FP8 is the one released by Qwen. I am very cautious about quants because the specific settings used when creating a quant can matter a lot more than q4 vs q6 vs q8 vs fp8 vs nvfp4 etc. If the person making the quant doesn’t know what they are doing or isn’t careful it’s going to be messed up.
Anbeeld@reddit
Is it just me or enabling any KV cache quantization makes everything slow as hell, especially prefill? I have 5700X3D and 3090.
tmvr@reddit
Someone else with a 3090 wrote this a few days ago and I've checked and I don't see a huge dropoff with a 4090 and the CUDA 12 version of b8733 llamacpp. I get about 8-9% drop at 128K+ context and 4-5% at 64K when switching to q8_0 from FP16.
fiddlerwoaroof@reddit
I’ve had to watch llama.cpp logs for this: in some cases, I get graph splits because the quantization isn’t supported on-GPU and so the data has to be shipped back to main memory and quantized on the CPU
Finanzamt_Endgegner@reddit
PPL is important but we should also test kdl, but i really hope this is true, it seems to be exceptionally error resistant with quantization of the weights already 🤯
Middle_Bullfrog_6173@reddit
KLD is important but you should really test accuracy on actual tasks.
Small PPL and KLD can still fail reasoning.
Finanzamt_Endgegner@reddit
Yeah ofc, the more testing the better 😉
imgroot9@reddit (OP)
agree, but I assume if the PPL is very low, it's a pretty safe bet that KDL is also minimal
Digger412@reddit
That's not necessarily true. PPL is the measure of surprisal at the next token, and considers only the most probable token. KLD measures the diff in distribution across the entire vocab.
A quote from the YAQA paper that has my favorite definition for PPL vs KLD:
TomLucidor@reddit
run Vectera and other hallucination benchmarks please
leonbollerup@reddit
Is there some page where optimal settings for models get collected, or should we build something ?
admajic@reddit
Do you find that one you get close to 180k context. The tokens/s is half the initial speed?
EbbNorth7735@reddit
I literally just tried turbquant in vllm and it told me it couldn't be used with Qwens architecture. Does anyone know if CoPilot lied about what command to use? Can it be done with vllm?
Velocita84@reddit
A certain ppl score on wikitext doesn't mean anything. Gemma 4 scores in the thousands and works just fine.
FullOf_Bad_Ideas@reddit
That's because of baked in chat templates.
You wouldn't be able to use it without them.
Velocita84@reddit
Uh, yeah? Same goes for any other model, i'm saying raw ppl score on wikitext is irrelevant. Delta is ok if the base score isn't ridiculously high like with gemma 4, but kld is still better.
FullOf_Bad_Ideas@reddit
not all models have chat templates baked in to the same degree and they can have normal perplexity even when they're instruct-tuned.
I think it's irrelevant only when it's ridiculously high on a model that clearly works well for chat.
Velocita84@reddit
Ok, i think i misunderstood what you were trying to say. Did you mean that gemma 4 specifically is overtrained for instruct at the cost of nonsensical raw text performance? Because yes, that's true
FullOf_Bad_Ideas@reddit
Yes, all instruct models have chat template tokens trained in, but if they respond to perplexity test that doesn't follow their chat template by completely collapsing, it's unusual. Most models do have reasonable perplexity regardless of those chat template tokens. Maybe it's some form of securing the model, maybe it's a bug. But I don't think this invalidates perplexity as a metric of output quality of LLMs that don't show this behavior.
imgroot9@reddit (OP)
added a comment to the post, thanks
hectaaaa@reddit
Commenting to get updates on this, seems interesting!
BringMeTheBoreWorms@reddit
Have you been using latest release of llamacpp? Optimisations went in early April based on turboquant that make q8 and q4 much less lobotomising.
I think q8 with llamacpp is pretty save to use as a default for most setups now.
Trouble with turboquant is that you have to use a build which is not up to latest llamacpp.
imgroot9@reddit (OP)
turboquant is also refreshed, so it is not much behind. but the main takeaway here is that q4 is almost identical to f16 if you have a 20B+ dense model.
BringMeTheBoreWorms@reddit
Yeah it’s pretty damn good.. I’ll be using q8 for a while before I trust q4 on everything though. Bigger broader testing might show where it starts to fail