KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche

Posted by Anbeeld@reddit | LocalLLaMA | View on Reddit | 32 comments

Here's my article with 38 quant pairs thoroughly benchmarked in KLD with 3 different Qwen 3.6 27B configs: Q5_K_S + 64k context, IQ4_XS + 64k context, IQ4_XS + 128k context. This allows us to track not only how cache quantizations affects the precision in a vacuum, but also how it interacts with noise from the model itself.

All benchmarks were done using my BeeLlama.cpp fork, allowing to include a number of quant types that are not present in mainline llama.cpp: vanilla TurboQuant, TCQ 3-bit/2-bit, and q6_0.

https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context

TL;DR

q5_0 KV is underrated, and same for q5_1 as V cache. Both really don't get the attention they deserve. Data shows they provide solid mid-range performance without being as heavy as q8_0 nor as shitty as q4_0.
q8_0 / q4_* is overrated. Strong K does not fully rescue weak V, and those pairs are too unbalanced and perform worse than the community reputation suggests.
Prefer sane KV quants over wasting VRAM on bf16 cache for heavily quantized weights. A Q4/IQ4 model with full bf16 KV looks like the wrong trade to me, and both draw from the same VRAM pool so you might want to balance them better.
Practical ladder: q8_0 / q6_0 or q8_0 / q5_1 for high-end, q6_0 / q5_0 for extra headroom, q5_0 / q5_0 or q5_0 / q4_1 when VRAM is tight, q4_0 / q4_0 only if no other options allow to fit the desired context.
TurboQuant is confirmed to be useful only as extreme compression. turbo3_tcq is the only type with decent quality per size, turbo4 is basically useless while also being slow.

KLD results on Q5_K_S + 64k context

The rest of benchmark data and in-depth analysis are available in the article.

Cache	Size	Mean KLD	Mean precision	99.9% KLD	99.9% precision	Tok/s
bf16	100.0%	0.000375	100.00%	0.023258	100.00%	850.81
q8_0	53.1%	0.002328	99.80%	0.078709	94.61%	851.11
q8_0-q6_0	46.9%	0.002499	99.79%	0.081616	94.33%	848.78
q8_0-q5_1	45.3%	0.002529	99.78%	0.082880	94.21%	828.63
q8_0-q5_0	43.8%	0.002656	99.77%	0.088486	93.69%	847.33
q8_0-q4_1	42.2%	0.003080	99.73%	0.099080	92.70%	786.54
q8_0-q4_0	40.6%	0.003316	99.71%	0.104680	92.18%	849.37
q6_0	40.6%	0.002614	99.78%	0.090800	93.47%	845.96
q8_0-turbo4	39.5%	0.003561	99.68%	0.103041	92.33%	838.90
q6_0-q5_1	39.1%	0.002781	99.76%	0.090447	93.50%	846.24
q5_1	37.5%	0.002911	99.75%	0.098354	92.77%	841.65
q6_0-q5_0	37.5%	0.002820	99.76%	0.092682	93.29%	846.86
q8_0-turbo3_tcq	36.7%	0.005090	99.53%	0.149387	88.15%	817.57
q6_0-q4_1	35.9%	0.003312	99.71%	0.104582	92.19%	848.42
q5_0	34.4%	0.003206	99.72%	0.099073	92.70%	849.79
q5_1-q4_1	34.4%	0.003380	99.70%	0.095011	93.08%	846.27
q6_0-q4_0	34.4%	0.003288	99.71%	0.111566	91.55%	848.24
q6_0-turbo4	33.2%	0.003748	99.66%	0.107377	91.93%	837.77
q5_0-q4_1	32.8%	0.003471	99.69%	0.099618	92.65%	847.59
q5_1-q4_0	32.8%	0.003626	99.68%	0.108649	91.82%	846.91
q4_1	31.3%	0.004476	99.59%	0.141813	88.82%	854.33
q5_0-q4_0	31.3%	0.003581	99.68%	0.113332	91.39%	847.64
q6_0-turbo3_tcq	30.5%	0.005379	99.50%	0.154680	87.68%	819.23
q5_0-turbo4	30.1%	0.003812	99.66%	0.112249	91.49%	837.52
q5_1-turbo3_tcq	28.9%	0.005594	99.48%	0.144591	88.57%	816.05
q4_0	28.1%	0.004711	99.57%	0.130419	89.84%	855.08
q5_0-turbo3_tcq	27.3%	0.005471	99.49%	0.158514	87.35%	815.80
q5_0-turbo3	27.0%	0.007097	99.33%	0.192428	84.44%	837.90
q4_1-turbo3_tcq	25.8%	0.006184	99.42%	0.174831	85.94%	816.95
turbo4	25.8%	0.004760	99.55%	0.138370	89.13%	705.32
q4_0-turbo3_tcq	24.2%	0.006269	99.41%	0.186572	84.93%	821.89
q4_0-turbo3	23.8%	0.008235	99.22%	0.222154	81.96%	839.29
q4_0-turbo2_tcq	21.1%	0.015168	98.53%	0.395244	68.94%	826.07
turbo3_tcq	20.3%	0.007978	99.24%	0.227104	81.56%	795.20
turbo3	19.5%	0.011181	98.93%	0.296060	76.12%	836.75
turbo3_tcq-turbo2_tcq	17.2%	0.016386	98.41%	0.437043	66.11%	796.16
turbo3-turbo2	16.4%	0.023985	97.67%	0.605087	55.89%	831.88
turbo2_tcq	14.1%	0.023073	97.76%	0.632401	54.38%	807.25
turbo2	13.3%	0.036230	96.48%	0.903576	41.47%	842.29

[-]

can999999999@reddit

I hate this "removed by Reddit's filters" bs so much, I wanted to save this for later. Can't post half the stuff I want to for the same reason.

[-]

bobaburger@reddit

Nice work! Thank you so much for doing this.

Also rendered to a diagram for easier gasp (ignored bf16)

[-]

Miserable-Dare5090@reddit

I cant believe they removed this post, it was very useful. Lmk if you have this diagram anywhere else!

[-]

bobaburger@reddit

I know, sometimes the auto-mod in this sub is very questionable. Maybe u/Anbeeld should message the mods to get this up.

As for this diagram, it's basically just an intepret of OP's data so I'll see if I have the chance to share it somewhere else.

[-]

PulseVector@reddit

Thanks for putting together all of this relevant information! I've been struggling with Qwen3.6 27B and usable KV quant values. It was my impression that there was a much bigger difference in accuracy between bf16 and q8_0, and am glad to see that's not true.

I'm especially interested in giving q8_0-turbo4 a try, since that ~17% space savings over Q8_0 is very appealing.

One question, do you know if this also applies to the new MTP settings such as:

--cache-type-k-draft q8_0

--cache-type-v-draft q8_0

Appreciate it!

[-]

ixdx@reddit

Can you also test with kv=f16? bf16 is not always suitable, as performance drops significantly with large contexts on at least the RTX5060Ti/RTX5070Ti.

[-]

No_Lingonberry1201@reddit

Am I reading it correct that q6_0 and q5_1 are barely wordse that q8_0?

[-]

Anbeeld@reddit (OP)

Yes, BUT in KLD benchmark. This BUT may become BUTT if you are working with something like data where you absolutely want the context to be the most precise possible. But for other usages where model quantization has more effect and cache quantization has less, you can opt for q6_0s and q5_1s of the world and raise your context size and/or model quant instead, gaining stuff like better coding output.

[-]

No_Lingonberry1201@reddit

I see, thanks for the clarification. Most of my use cases are for coding, but I do use it for data as well.

[-]

Anbeeld@reddit (OP)

You can create multiple different launch configs for various use cases using different model and cache quants (and context size), so you can use whatever is best for the current task.

[-]

No_Lingonberry1201@reddit

So if I have, say a model preset with a Qwen3.6-27B-coder and Qwen3.6-27B-data, one with q6_0 and the other with q8_0, then it won't load the model files twice in the memory? Because so far I haven't experimented with it too much.

[-]

Anbeeld@reddit (OP)

What I meant is you can restart the server to use whatever config you need at the moment, which is of course more maintenance but doesn't take too much time.

[-]

No_Lingonberry1201@reddit

Ah ok, thanks for the clarification.

[-]

laul_pogan@reddit

The VRAM-pool point hits hard in vLLM too: --gpu-memory-utilization reserves that fraction for KV cache after weights load, and vLLM allocates cache slots from that pool at startup. On a 27B in bf16 weights, dropping KV from bf16 to q6/q5 doesn't just save VRAM in the abstract, it directly multiplies the number of live cache slots vLLM can pre-allocate. Running q6_0/q5_0 KV instead of bf16 on the same 60% utilization budget roughly doubles concurrent context capacity before any swapping kicks in. So OP's ladder isn't just a quality tradeoff, it's also a concurrency tradeoff for any server-mode runtime that pre-allocates the cache at init.

[-]

BitGreen1270@reddit

Thanks for doing this - so q8/q8 is less practical than q8/q6?

[-]

Anbeeld@reddit (OP)

The benchmarks are targeted at us VRAM-constrained folks, the people. And with our petty 16 GB and 24 GB VRAM the q8 is often just impractical all around, at least with dense models. But even with more VRAM you can opt for q8/q5_1 or q8/q6 if it's available (which it isn't in mainline llama.cpp) and gain larger context size and/or higher model quant by using freed up VRAM.

[-]

BitGreen1270@reddit

I've got a 5090 and I'm also hitting constraints. Just bumped down from Gemma 31b Q6_K_L to Q5_K_L so I could have more layers on GPU at higher context.

I haven't used beellama.cpp but might try it out if it gives me stability at higher model quant.

[-]

Anbeeld@reddit (OP)

Try q8_0/q5_1 in your existing setup first, the difference to q8_0/q6_0 is minor and BeeLlama version that supports q6_0 is still WIP anyways. :P

[-]

BitGreen1270@reddit

Oh I didn't know q5_1 is available in main llama as well. Amazing, I can at least start Gemma 4 31B with 64k context and 99 ngl. Couldn't even do that before. Thanks 👍, I'll play around with this tomorrow and see how it goes

[-]

taking_bullet@reddit

Q8_0 is always my primary choice, but when it comes to Qwen 27B I'm forced to use Q6_0 (because of 32GB VRAM).

[-]

soyalemujica@reddit

I've sat with Q5_1/Q4_1 in 120k context in C++ agentic coding, have not experienced a single hallucination.

[-]

Anbeeld@reddit (OP)

It's really solid for the price! If you go any lower, it really falls of a cliff, but up until this point it holds up fine. At least with Qwen 3.6 27B.

[-]

soyalemujica@reddit

Yeah, got to mention that, I'm running Qwen 3.6 27B at Q5KM with 24gb vram and it's amazing!

[-]

Anbeeld@reddit (OP)

Q5 is kind of sweet spot there ain't it? At Q4 the math just get dire with how squashed the values become, but Q5 is 2x step over so the degradation is much less obvious.

[-]

jeekp@reddit

Thanks for sharing. The 99.9% precision metric more matches my anecdotal experience with KV cache quant. That is to say, I've found the accuracy hit does not justify its use for data normalization or coding.

[-]

Anbeeld@reddit (OP)

For data, KV cache indeed have higher expectations so it doesn't mess the input specifically. But for coding it's the same question of balancing quants of the model vs cache, as lower model quant = worse code.

[-]

ggyurov@reddit

turbo4 worse than q4 ??? Lol, and TurboQuant is advertised as quality saver.

[-]

Anbeeld@reddit (OP)

As described in the article, it's because q4 is not some old ass dinosaur, but basically TurboQuant-lite since llama.cpp added rotation to the base types too after TQ got traction. In the end it ends up being even better than turbo4 except some edge cases, also honestly both are quite shit the moment you want to do some tool calls at high context and everything falls apart.

[-]

Mountain_Patience231@reddit

but your chart shows Q8 is the best of the rest

[-]

Anbeeld@reddit (OP)

Yeah, and bf16 is even better than q8! But neither fit into my RTX 3090 at 100k with decent model quant and speculative decoding. And even if they would, I could do 200k with cheaper quants instead.

And that's in 24 GB. If you have 16 GB, like many many many people here do, q8 is just a luxury, at least with dense models where you can't offload MoE into RAM.

[-]

Mountain_Patience231@reddit

i think i will try q6_0, its looking good on your data

[-]

Anbeeld@reddit (OP)

Keep in mind it's not in mainline llama.cpp, you need ik_llama.cpp (they also have a ton of other quants)... or BeeLlama v0.3.0 which is WIP at the moment. :P