KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche
Posted by Anbeeld@reddit | LocalLLaMA | View on Reddit | 32 comments
Here's my article with 38 quant pairs thoroughly benchmarked in KLD with 3 different Qwen 3.6 27B configs: Q5_K_S + 64k context, IQ4_XS + 64k context, IQ4_XS + 128k context. This allows us to track not only how cache quantizations affects the precision in a vacuum, but also how it interacts with noise from the model itself.
All benchmarks were done using my BeeLlama.cpp fork, allowing to include a number of quant types that are not present in mainline llama.cpp: vanilla TurboQuant, TCQ 3-bit/2-bit, and q6_0.
https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context
TL;DR
q5_0KV is underrated, and same forq5_1as V cache. Both really don't get the attention they deserve. Data shows they provide solid mid-range performance without being as heavy asq8_0nor as shitty asq4_0.q8_0 / q4_*is overrated. Strong K does not fully rescue weak V, and those pairs are too unbalanced and perform worse than the community reputation suggests.- Prefer sane KV quants over wasting VRAM on
bf16cache for heavily quantized weights. AQ4/IQ4model with fullbf16KV looks like the wrong trade to me, and both draw from the same VRAM pool so you might want to balance them better. - Practical ladder:
q8_0 / q6_0orq8_0 / q5_1for high-end,q6_0 / q5_0for extra headroom,q5_0 / q5_0orq5_0 / q4_1when VRAM is tight,q4_0 / q4_0only if no other options allow to fit the desired context. - TurboQuant is confirmed to be useful only as extreme compression.
turbo3_tcqis the only type with decent quality per size,turbo4is basically useless while also being slow.
KLD results on Q5_K_S + 64k context
The rest of benchmark data and in-depth analysis are available in the article.
| Cache | Size | Mean KLD | Mean precision | 99.9% KLD | 99.9% precision | Tok/s |
|---|---|---|---|---|---|---|
| bf16 | 100.0% | 0.000375 | 100.00% | 0.023258 | 100.00% | 850.81 |
| q8_0 | 53.1% | 0.002328 | 99.80% | 0.078709 | 94.61% | 851.11 |
| q8_0-q6_0 | 46.9% | 0.002499 | 99.79% | 0.081616 | 94.33% | 848.78 |
| q8_0-q5_1 | 45.3% | 0.002529 | 99.78% | 0.082880 | 94.21% | 828.63 |
| q8_0-q5_0 | 43.8% | 0.002656 | 99.77% | 0.088486 | 93.69% | 847.33 |
| q8_0-q4_1 | 42.2% | 0.003080 | 99.73% | 0.099080 | 92.70% | 786.54 |
| q8_0-q4_0 | 40.6% | 0.003316 | 99.71% | 0.104680 | 92.18% | 849.37 |
| q6_0 | 40.6% | 0.002614 | 99.78% | 0.090800 | 93.47% | 845.96 |
| q8_0-turbo4 | 39.5% | 0.003561 | 99.68% | 0.103041 | 92.33% | 838.90 |
| q6_0-q5_1 | 39.1% | 0.002781 | 99.76% | 0.090447 | 93.50% | 846.24 |
| q5_1 | 37.5% | 0.002911 | 99.75% | 0.098354 | 92.77% | 841.65 |
| q6_0-q5_0 | 37.5% | 0.002820 | 99.76% | 0.092682 | 93.29% | 846.86 |
| q8_0-turbo3_tcq | 36.7% | 0.005090 | 99.53% | 0.149387 | 88.15% | 817.57 |
| q6_0-q4_1 | 35.9% | 0.003312 | 99.71% | 0.104582 | 92.19% | 848.42 |
| q5_0 | 34.4% | 0.003206 | 99.72% | 0.099073 | 92.70% | 849.79 |
| q5_1-q4_1 | 34.4% | 0.003380 | 99.70% | 0.095011 | 93.08% | 846.27 |
| q6_0-q4_0 | 34.4% | 0.003288 | 99.71% | 0.111566 | 91.55% | 848.24 |
| q6_0-turbo4 | 33.2% | 0.003748 | 99.66% | 0.107377 | 91.93% | 837.77 |
| q5_0-q4_1 | 32.8% | 0.003471 | 99.69% | 0.099618 | 92.65% | 847.59 |
| q5_1-q4_0 | 32.8% | 0.003626 | 99.68% | 0.108649 | 91.82% | 846.91 |
| q4_1 | 31.3% | 0.004476 | 99.59% | 0.141813 | 88.82% | 854.33 |
| q5_0-q4_0 | 31.3% | 0.003581 | 99.68% | 0.113332 | 91.39% | 847.64 |
| q6_0-turbo3_tcq | 30.5% | 0.005379 | 99.50% | 0.154680 | 87.68% | 819.23 |
| q5_0-turbo4 | 30.1% | 0.003812 | 99.66% | 0.112249 | 91.49% | 837.52 |
| q5_1-turbo3_tcq | 28.9% | 0.005594 | 99.48% | 0.144591 | 88.57% | 816.05 |
| q4_0 | 28.1% | 0.004711 | 99.57% | 0.130419 | 89.84% | 855.08 |
| q5_0-turbo3_tcq | 27.3% | 0.005471 | 99.49% | 0.158514 | 87.35% | 815.80 |
| q5_0-turbo3 | 27.0% | 0.007097 | 99.33% | 0.192428 | 84.44% | 837.90 |
| q4_1-turbo3_tcq | 25.8% | 0.006184 | 99.42% | 0.174831 | 85.94% | 816.95 |
| turbo4 | 25.8% | 0.004760 | 99.55% | 0.138370 | 89.13% | 705.32 |
| q4_0-turbo3_tcq | 24.2% | 0.006269 | 99.41% | 0.186572 | 84.93% | 821.89 |
| q4_0-turbo3 | 23.8% | 0.008235 | 99.22% | 0.222154 | 81.96% | 839.29 |
| q4_0-turbo2_tcq | 21.1% | 0.015168 | 98.53% | 0.395244 | 68.94% | 826.07 |
| turbo3_tcq | 20.3% | 0.007978 | 99.24% | 0.227104 | 81.56% | 795.20 |
| turbo3 | 19.5% | 0.011181 | 98.93% | 0.296060 | 76.12% | 836.75 |
| turbo3_tcq-turbo2_tcq | 17.2% | 0.016386 | 98.41% | 0.437043 | 66.11% | 796.16 |
| turbo3-turbo2 | 16.4% | 0.023985 | 97.67% | 0.605087 | 55.89% | 831.88 |
| turbo2_tcq | 14.1% | 0.023073 | 97.76% | 0.632401 | 54.38% | 807.25 |
| turbo2 | 13.3% | 0.036230 | 96.48% | 0.903576 | 41.47% | 842.29 |
can999999999@reddit
I hate this "removed by Reddit's filters" bs so much, I wanted to save this for later. Can't post half the stuff I want to for the same reason.
bobaburger@reddit
Nice work! Thank you so much for doing this.
Also rendered to a diagram for easier gasp (ignored bf16)
Miserable-Dare5090@reddit
I cant believe they removed this post, it was very useful. Lmk if you have this diagram anywhere else!
bobaburger@reddit
I know, sometimes the auto-mod in this sub is very questionable. Maybe u/Anbeeld should message the mods to get this up.
As for this diagram, it's basically just an intepret of OP's data so I'll see if I have the chance to share it somewhere else.
PulseVector@reddit
Thanks for putting together all of this relevant information! I've been struggling with Qwen3.6 27B and usable KV quant values. It was my impression that there was a much bigger difference in accuracy between bf16 and q8_0, and am glad to see that's not true.
I'm especially interested in giving q8_0-turbo4 a try, since that ~17% space savings over Q8_0 is very appealing.
One question, do you know if this also applies to the new MTP settings such as:
--cache-type-k-draft q8_0
--cache-type-v-draft q8_0
Appreciate it!
ixdx@reddit
Can you also test with kv=f16? bf16 is not always suitable, as performance drops significantly with large contexts on at least the RTX5060Ti/RTX5070Ti.
No_Lingonberry1201@reddit
Am I reading it correct that q6_0 and q5_1 are barely wordse that q8_0?
Anbeeld@reddit (OP)
Yes, BUT in KLD benchmark. This BUT may become BUTT if you are working with something like data where you absolutely want the context to be the most precise possible. But for other usages where model quantization has more effect and cache quantization has less, you can opt for q6_0s and q5_1s of the world and raise your context size and/or model quant instead, gaining stuff like better coding output.
No_Lingonberry1201@reddit
I see, thanks for the clarification. Most of my use cases are for coding, but I do use it for data as well.
Anbeeld@reddit (OP)
You can create multiple different launch configs for various use cases using different model and cache quants (and context size), so you can use whatever is best for the current task.
No_Lingonberry1201@reddit
So if I have, say a model preset with a Qwen3.6-27B-coder and Qwen3.6-27B-data, one with q6_0 and the other with q8_0, then it won't load the model files twice in the memory? Because so far I haven't experimented with it too much.
Anbeeld@reddit (OP)
What I meant is you can restart the server to use whatever config you need at the moment, which is of course more maintenance but doesn't take too much time.
No_Lingonberry1201@reddit
Ah ok, thanks for the clarification.
laul_pogan@reddit
The VRAM-pool point hits hard in vLLM too:
--gpu-memory-utilizationreserves that fraction for KV cache after weights load, and vLLM allocates cache slots from that pool at startup. On a 27B in bf16 weights, dropping KV from bf16 to q6/q5 doesn't just save VRAM in the abstract, it directly multiplies the number of live cache slots vLLM can pre-allocate. Runningq6_0/q5_0KV instead of bf16 on the same 60% utilization budget roughly doubles concurrent context capacity before any swapping kicks in. So OP's ladder isn't just a quality tradeoff, it's also a concurrency tradeoff for any server-mode runtime that pre-allocates the cache at init.BitGreen1270@reddit
Thanks for doing this - so q8/q8 is less practical than q8/q6?
Anbeeld@reddit (OP)
The benchmarks are targeted at us VRAM-constrained folks, the people. And with our petty 16 GB and 24 GB VRAM the q8 is often just impractical all around, at least with dense models. But even with more VRAM you can opt for q8/q5_1 or q8/q6 if it's available (which it isn't in mainline llama.cpp) and gain larger context size and/or higher model quant by using freed up VRAM.
BitGreen1270@reddit
I've got a 5090 and I'm also hitting constraints. Just bumped down from Gemma 31b Q6_K_L to Q5_K_L so I could have more layers on GPU at higher context.
I haven't used beellama.cpp but might try it out if it gives me stability at higher model quant.
Anbeeld@reddit (OP)
Try q8_0/q5_1 in your existing setup first, the difference to q8_0/q6_0 is minor and BeeLlama version that supports q6_0 is still WIP anyways. :P
BitGreen1270@reddit
Oh I didn't know q5_1 is available in main llama as well. Amazing, I can at least start Gemma 4 31B with 64k context and 99 ngl. Couldn't even do that before. Thanks 👍, I'll play around with this tomorrow and see how it goes
taking_bullet@reddit
Q8_0 is always my primary choice, but when it comes to Qwen 27B I'm forced to use Q6_0 (because of 32GB VRAM).
soyalemujica@reddit
I've sat with Q5_1/Q4_1 in 120k context in C++ agentic coding, have not experienced a single hallucination.
Anbeeld@reddit (OP)
It's really solid for the price! If you go any lower, it really falls of a cliff, but up until this point it holds up fine. At least with Qwen 3.6 27B.
soyalemujica@reddit
Yeah, got to mention that, I'm running Qwen 3.6 27B at Q5KM with 24gb vram and it's amazing!
Anbeeld@reddit (OP)
Q5 is kind of sweet spot there ain't it? At Q4 the math just get dire with how squashed the values become, but Q5 is 2x step over so the degradation is much less obvious.
jeekp@reddit
Thanks for sharing. The 99.9% precision metric more matches my anecdotal experience with KV cache quant. That is to say, I've found the accuracy hit does not justify its use for data normalization or coding.
Anbeeld@reddit (OP)
For data, KV cache indeed have higher expectations so it doesn't mess the input specifically. But for coding it's the same question of balancing quants of the model vs cache, as lower model quant = worse code.
ggyurov@reddit
turbo4 worse than q4 ??? Lol, and TurboQuant is advertised as quality saver.
Anbeeld@reddit (OP)
As described in the article, it's because q4 is not some old ass dinosaur, but basically TurboQuant-lite since llama.cpp added rotation to the base types too after TQ got traction. In the end it ends up being even better than turbo4 except some edge cases, also honestly both are quite shit the moment you want to do some tool calls at high context and everything falls apart.
Mountain_Patience231@reddit
but your chart shows Q8 is the best of the rest
Anbeeld@reddit (OP)
Yeah, and bf16 is even better than q8! But neither fit into my RTX 3090 at 100k with decent model quant and speculative decoding. And even if they would, I could do 200k with cheaper quants instead.
And that's in 24 GB. If you have 16 GB, like many many many people here do, q8 is just a luxury, at least with dense models where you can't offload MoE into RAM.
Mountain_Patience231@reddit
i think i will try q6_0, its looking good on your data
Anbeeld@reddit (OP)
Keep in mind it's not in mainline llama.cpp, you need ik_llama.cpp (they also have a ton of other quants)... or BeeLlama v0.3.0 which is WIP at the moment. :P