Kv cache quantization: ignorance, or malice?

Posted by wombweed@reddit | LocalLLaMA | View on Reddit | 92 comments

I run Qwen-3.6 27B FP8 on vllm for long-horizon agentic coding harness workloads with high context window and concurrent sub-agents. On two 3090s that aren’t used for anything else, it seems reasonable to expect a good balance between speed and reliability. I want to bring up a particular point of contention regarding this optimization process. I have extensive software engineering background but am relatively new to this so feel free to correct me if I’m not on the right track.

It seems like conventional wisdom is that you shouldn’t quantize kv cache. In my experience, with my specific workloads, that remains true: at q8, I see many subtle mistakes, tool calling issues, and just plain bad reasoning. The performance is dramatically higher when I pin it at 16 bit.

So with that in mind why do I keep seeing people gesturing at this like it’s a serious solution? I guess I can see it if it’d just low stakes chatbot stuff. But why would anyone run anything serious at anything less than full sized kv? I keep seeing stuff about turboquant as well and haven’t tried it but from what I understood, it seems like it comes with an intelligence hit too.

So am I understanding correctly?

[-]

Gesha24@reddit

I am convinced majority of the people are not running local AI for any kind of serious work, it's mostly for fun. So accuracy is irrelevant for them. Once you realize accuracy matters, you have to set up system differently and you aren't getting those fancy tokens per second anymore.

Another problem - accuracy is hard to measure. Unlike tokens per second it requires some kind of smarter benchmark. And many of not all of them aren't doing a good job capturing reality

[-]

BannedGoNext@reddit

On 35b that means I could have an agent determine a gap in testing and build a plan, another agent code a test and test it, then an agent do a dialectical pass, and another agent to resolve the problems found on the dialectical pass.

All that in the time 27b would have found a gap.

Maybe the gap would have been a better gap, but wall time matters too!

[-]

havnar-@reddit

Wanting 15 minutes for qwen do build something vs waiting 2 minutes and then going back and forth for the whole afternoon to get anywhere

That’s my benchmark

[-]

Gesha24@reddit

The sad state of affairs is - you can't do this even with Opus anymore. Just yesterday I was having it write me some MCP server. Overall design - fine, no problems. Code is written, tests are passing. But lots of small details were hallucinated. I had Qwen test real use cases and fix problems.

My personal preference - I choose model based on the harness I am using. If I am working on the project I started with claude code - I will take 35B MoE qwen, it has the content and the performance to keep going. Yes, it will forget a few things here and there, but mostly it's totally workable. If I started project with pi - I am definitely using higher-quant version of 27B, otherwise the damn thing forgets my instructions and there are problems.

[-]

crantob@reddit

Greetj gs Gesha24. I have a medical condition that make me highly allergic to bloated library dependencies. Do you have any comments as to whether *pi* is a usable harness for a coder model to do testing and iterate (with vision)?

[-]

Gesha24@reddit

Most of the harnesses are useable. Whether you like it or not - that's a whole different story. Personally - yes, I do like it

[-]

a_beautiful_rhind@reddit

It depends on the model and backend too. Dense is less affected as a general rule.

New models are much more cooked than ones from a year ago where hardly anyone talked about KV quanting being super terrible.

Look at how poorly ones like GPT-OSS work out of distribution or even gemma. Minor inaccuracies are bowling them over. We're also taking their claimed context lengths at face value while in the past people questioned/tested performance on 16-32-64k. At the same time, coding and agentic requires that context.

[-]

brahh85@reddit

talking about models and how they resist cache quantization https://www.reddit.com/r/LocalLLaMA/comments/1suh3sz/gemma_4_and_qwen_36_with_q8_0_and_q4_0_kv_cache/

[-]

a_beautiful_rhind@reddit

Its's actually a good example. Qwen, decent performance even on Q4. Gemma, super cooked, 4k PPL on the IT version, even sees some degradation at Q8.

[-]

Gesha24@reddit

I have actually not tested it. It's considered that Qwen's 35B MoE model is about as smart as the 9B model. If we compare those 2 - is MoE really affected more? Or did the people compare 27B and 35B and came to this conclusion (which is quite far removed from reality, because you need very different gear to get solid performance from 27B compared to 35B)

[-]

a_beautiful_rhind@reddit

You are supposed to compare the same model with quantized cache and without.

[-]

Ok-Measurement-1575@reddit

59/410 [1:12:14<7:09:45, 73.46s/it, Correct=48, Wrong=11, Accuracy=81.36]

I've burned so much electricity testing for accuracy. It's sort of a compulsion at this point.

[-]

rpkarma@reddit

Most here “measure” accuracy based on vibes lol

[-]

ilintar@reddit

Yeah and then when you point them to actual benchmarks they still know better.

[-]

rpkarma@reddit

To be fair that goes both ways. VLLM’s paper on FP8 KV cache linked below is quite interesting.

[-]

sine120@reddit

Many of us don't have a point of comparison in the same model. For me, IQ3 is the best I can use, so it's the best that I've seen. Any other models at higher quantization I can compare to will be fewer parameters.

[-]

draconic_tongue@reddit

georgi said to run this because it shows different performance results compared to perplexity/kld comparisons https://github.com/ggml-org/llama.cpp/pull/21152 and he was right, despite kld and perplexity being lower on q4 kv compared to q8 or fp16 they all got the same results other than some token count differences

other than that idk what benchmarks there could be, best benchmark is your actual use case

[-]

Gesha24@reddit

Yes, it's hard. One of the things I have noticed when working more with pi - my MoE models tend to forget strict early instructions, while dense tend to remember them better. Meaning that if in the beginning of sessions I say "ask me before any file edit", it is more likely that by the end of session it will do an edit without explicitly asking for permission. But on the other hand - it could be something wrong with the particular MoE model I am using.

Maybe I should write some code to be able to test it and see how it performs...

[-]

pkmxtw@reddit

I always see those posts about tightly fitting a IQ3 quant of a 30% REAP model with Q4 KV cache on a 16GB/24GB card to squeeze some 36K context so their tg speed go brrr more of just an exercise in futility rather than actually being practical. At that point you are barely running the same model as it is intended to be.

[-]

-dysangel-@reddit

I think it's because most setups don't have a lot of VRAM, so people are just constantly looking for ways to squeeze in the biggest model and context that they can. Like the rest of us! But I agree - quantising weights seems to be far less destructive than quantising the KV cache. If I were to draw a silly analogy, I feel like moderately quantising the weights is probably like giving the model a lack of sleep or a bad headache, while quantising the KV cache will be more like giving them a degenerative brain disease.

[-]

droptableadventures@reddit

That's definitely a factor - if you can't run it, that's a 0% success rate on your task.

[-]

CalligrapherFar7833@reddit

Its also a 0% success rate if it produces garbage nonsense due to quant kv cache

[-]

droptableadventures@reddit

Not exactly 0%, it could happen to randomly output the right answer.

[-]

CalligrapherFar7833@reddit

If i give you a pizza and i tell you its not exactly 0% filled with shit , it could randomly be filled with shit will you eat it ?

[-]

-dysangel-@reddit

Have you ever trained a neural net? From the way you're thinking/talking about this, I've got a guess.

[-]

CalligrapherFar7833@reddit

Dude i get that models are undeterministic im saying that your reply didnt make sense

[-]

-dysangel-@reddit

it wasn't my reply

[-]

ilintar@reddit

On llama.cpp Q8 KV quant is almost lossless, as shown by multiple benchmarks (Gemma 4, by comparison, due to its iSWA architecture, is apparently much more sensitive to KV cache quantization).

[-]

VotZeFuk@reddit

(stating the obvious) Almost lossless isn't lossless, isn't that the issue here?

[-]

Nyghtbynger@reddit

if your context is bigger too you'll see less near window's end context degradation

[-]

hellomistershifty@reddit

It would make sense to me if the AI was actually some perfectly accurate oracle. If the real difference is like 64% vs 65% accurate, then what's the point?

[-]

SteppenAxolotl@reddit

it's a trade off, some set their bar higher than others

[-]

darkwalker247@reddit

perhaps there's a way to efficiently identify which KV pairs are safest to quantize and quantize only those for a mixed precision KV cache?

[-]

Gesha24@reddit

Are those the same benchmarks that put Qwen3.6 on par with Opus4.6? Because there are some like that.

[-]

ilintar@reddit

No, there were benchmarks even here on Localllama where people ran both against some common benchmarks like HumanEval.

[-]

wbulot@reddit

I have been running the model with Q8 KV for a few days now. I have done a lot of work with it, and I cannot notice any quality difference.

[-]

qudat@reddit

Ya I run 26B Q4 with Q8 kv and it works great

[-]

dinerburgeryum@reddit

Yeah, Gemma is the fly in the ointment here. Wish they had kept the option to disable quantization on the SWA cache feel like it was a bit of a miss to remove it.

[-]

ikkiho@reddit

A few things in this thread are getting blurred together and I think that explains the conflicting results.

First, fp8 in vllm and Q8 in current llama.cpp are not the same operation. fp8 (E4M3) is a per-tensor float format with one global calibration and no rotation. Q8 in llama.cpp now applies a per-group integer scale plus a Hadamard rotation on K (and as of recently, V too), which makes the quantized distribution roughly Gaussian and bounds the worst-case per-element error. So when ilintar says "almost lossless" on llama.cpp Q8 and you say fp8 on vllm collapses your agent, you can both be right. Different operations, different error tails. The thread keeps comparing them as if "kv quant" is one thing.

Second, K and V are asymmetric. K has heavy per-channel outliers, especially in the RoPE channels, because the rotary frequencies pile up energy on a small number of dimensions. V is much better behaved. Naive fp8 squashes every channel into one scale, which clips those K outliers, and that is exactly the failure mode you describe: tool-call tokens are precision-critical (one wrong character breaks JSON), and attention scores propagate through softmax where small score deltas blow up after exp. Long-horizon agentic loops are the worst case because the error accumulates over thousands of steps and cross-step KV stays in the cache.

Third, this is also why TurboQuant and the QuaRot / SpinQuant line of work look dramatic. They rotate first (Walsh-Hadamard or learned orthogonal), which provably flattens the per-channel max, then quantize. The quant noise becomes near-uniform instead of having catastrophic tails. Naive fp8 has neither rotation nor per-group calibration.

Practical read: for your workload (vllm, agentic coding, long context), fp8 KV is the wrong knob to test. Q8 with rotation in llama.cpp, or a TurboQuant-style implementation when vllm picks it up, should land much closer to bf16. Worth A/B before declaring all KV quant broken.

[-]

Important_Quote_1180@reddit

That is why turbo quant got so much early play. KV kills the consumer hardware for most users. I’m actually really impressed with Autoround quantizations and how well TQ3 works. Just a single 3090 and 256k context with 40 toks is exactly what I needed to create spec work for CC.

[-]

Ok-Measurement-1575@reddit

Is that actually working at full context? What version of vllm etc? I was seeing oom @ 65k.

[-]

ambient_temp_xeno@reddit

Does vllm even have q8 kv cache quantization? If it's fp8 then that's way worse.

[-]

McSendo@reddit

They recently published a blog about FP8. https://vllm.ai/blog/fp8-kvcache

[-]

rpkarma@reddit

Which matches what I’ve seen. Hopper, like OP, has a hardware bug. Blackwell with a dense model is fine.

[-]

ambient_temp_xeno@reddit

This graph was quite interesting for qwen's context in general, regardless of kv quants. 128k seems fine and 256k quite usable.

[-]

McSendo@reddit

Yea, definitely something to keep in mind when doing any agentic work.

[-]

wombweed@reddit (OP)

Hey, my mistake, I think you’re right. The recommendation I had in mind is from this page https://github.com/noonghunna/qwen36-dual-3090 which says fp8 not q8. I didn’t realize fp8 is worse.

[-]

One-Replacement-37@reddit

This repository is based on the Genesis repository. Genesis does add support for TQ + MTP + Qwen3.6.

[-]

wombweed@reddit (OP)

Do you think I’d have better intelligence if I tried it with tq? I was disappointed with fp8 here and felt nervous about dropping from fp8 model to int4.

[-]

One-Replacement-37@reddit

Like I said. FP8 divides intelligence by 2. TQ only drops 1-2%.

[-]

ambient_temp_xeno@reddit

Yes fp8 is significantly worse than q8, especially in terms of being used in kv cache.

[-]

wombweed@reddit (OP)

Forgive my ignorance, I am interested in learning about the implementation difference between both and how significant the end result difference is. Do you know where I can find out more about that?

[-]

ambient_temp_xeno@reddit

This is the k quants pr: https://github.com/ggml-org/llama.cpp/pull/1684

k part of kv cache quanting pr:

https://github.com/ggml-org/llama.cpp/pull/4312

This image comparison shows the kind of difference visually compared to fp16

[-]

thirteen-bit@reddit

Q8_0 cache is rotated by default now, here are the PPL / KLD plots vs f16:

https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4146397570

[-]

wombweed@reddit (OP)

Really appreciate the links and visualization, thanks!

[-]

One-Replacement-37@reddit

Without TQ: the numbers are simply truncated, and we say “good luck” to the model. With TQ, the data is first encoded such that there’s virtually no information left in the bits that are to be truncated. That’s why TQ KV quant doesn’t impact quality much.

[-]

MmmmMorphine@reddit

Not anything good, that's for sure. Probably one of the biggest problems with it IMO. Which is a shame, as it's got truly fantastic other types of kv cache management orthogonal to basic quantization via LMCache.

Combining the two (quant and LMCache) would be fantastic.

I suppose sglang has more options with kv cache quantization and works with LMCache

Feel like Walsh-Hadamard (mainline ik_llama) quantization is the best, as currently implemented if not in theory. At least in my opinion.

[-]

One-Replacement-37@reddit

Yes it does, https://github.com/noonghunna/club-3090

[-]

a_beautiful_rhind@reddit

Nope. A half-assed implementation of FP8 is all it has.

[-]

shammyh@reddit

Uhhhh... Haven't all empirical benchmarks confirmed that fp8 kv quant is near identical to the full blfoat16? At least for the Qwen 3.5/3.6 27b dense models.

So is there some data here? Or we all just replying based on vibes?

[-]

wbulot@reddit

I'm wondering the same thing. I do run the model with fp8 kv quant, and everything is working perfectl. Coding, tool calling, etc. I can't see any difference with the full version.

[-]

ThisGonBHard@reddit

What benchmarks?

In my tests, even 18 was bad for normal use. For images, any LV quantisation just straight on lobotomized the model.

[-]

draconic_tongue@reddit

can you show tests/results?

[-]

Awwtifishal@reddit

Q8 KV cache is quite workable nowadays when used in conjunction with vector rotation (which is enabled by default in llama.cpp). People say there's barely any difference.

[-]

Anbeeld@reddit

TurboQuant is the answer.

[-]

dampflokfreund@reddit

Well according to current data, it is worse than current q4_0 in llama.cpp.

[-]

draconic_tongue@reddit

Not much worse, but not much better either. It's a sidegrade with a +- 1-2%

[-]

tenebreoscure@reddit

Maybe they are nor ignorant nor malign, they know their use case doesn't need maximum precisions and they are ok with the limitations. Or the models they use are less sensitive to KV cache quantization. Not everyone is a coder, and even coders do not always need a huge context, where errors accumulate and make the whole conversation collapse.
Agentic coding on long context is probably the most demanding task for an llm, where even two tabs instead of one can lead to collapse. And it only works thanks to compilers by the way, without them even fp16 KV cache wouldn't be enough.

Also every work can be serious, it depends on the use case and the context. Coding is not the only serious use case for AI.

[-]

wombweed@reddit (OP)

Not meant as an attack. It is my main use case which I use as my gold standard, but I have a variety of applications for it, well aware coding is not the only task it can be useful for..

[-]

n4pst3r3r@reddit

I am using Qwen3.6 27B Q4 something with q8 kv quant (because it fits in my 3090) for C++ programming in a reasonably well structured but fairly large (some 7k translation units) proprietary codebase, so not something easy like one-shotting python scripts. Harness is Mistral Vibe. The way I'm using it is not "Give it a vague description and then YOLO", but rather specify what changes I want in which file and it gives me a good approximation, often even something that works out of the box. But due diligence requires that I review every single line it wrote and clean up the code. No way around that, even if I'd be using frontier models. Then I request the next change.

This human in the loop approach makes it very important to have fast generation, otherwise I'd be spending more time waiting on it, and my time is expensive. If the quant makes it only 85% correct instead of 90%, it hardly makes a difference, because I have to touch it up anyway. And not even opus gets it right 100% of the time.

[-]

segmond@reddit

it's okay for chat. I never quant my KV ever, I first noticed this 2 years ago while using an image model and it dawned on me that logical and very fine grained actions need every bit possible. as I often mention, quality of tokens beats quantity of tokens.

[-]

draconic_tongue@reddit

this is why I run my models at fp128

[-]

One-Replacement-37@reddit

Turboquant completely changed the equation.

[-]

def_not_jose@reddit

Turboquant didn't convince llama.cpp nerds enough to include it though

[-]

draconic_tongue@reddit

ultimately it depends on what you're doing and I doubt people spend enough time comparing and noting down results in any objective manner. for what it's worth, numbers don't really mean much, and the most testing I've done is aime2025 on which qwen 3.6 35a3b got the same results and took about the same time regardless of kv cache, which goes against any number benchmark difference

[-]

UncleRedz@reddit

I think there are too many moving parts to give any definitiv answer. Are you using vLLM, Llama.cpp, ik_llama, and what versions of those? What model and what quant of that model? Llama.cpp is so fast moving that things can change in weeks.

Also what harness/frontend? Some are just worse with tool calls in general. Also how many tools have you made available to the model? Keeping it to a minimum has worked best for me.

I normally run with Q4-Q6 or Q8 for the model, depending on how it fits, and for kv cache either fp16 or Q8 if I need to squeeze in a bigger context. Mostly doing data processing and tool calls are normally not an issue with my use cases.

However I also have a practice of keeping the context length low, processing documents might temporarily grow to 60-100k tokens, but then compact when that part of the process is done, or start a new session all together. Avoiding having old noise in the context have had a bigger impact than any kv cache quant. For coding this would probably be similar to size of tasks for a sub-agent and scope of tasks to keep the agent focused.

[-]

SnooPaintings8639@reddit

With my default setup, using Qwen 27b, I get around 60 tps at empty context window, there is very little difference between bf16 or Q8 for KV cache, when it comes to this value.

When I reach 200k context, I get over 30 tps with Q8 and sub 10 tps with BF16.

Long context task is important for such overthinking models as qwen, especially for any agentic usage. A single coding task, in non interactive mode using pi, often crosses 100k tokens. If I add review iterations, it is hitting another 50k. I can't be doing that at 10 tps. I tired, it does not make sense.

The quality hit I have read here about is still something I am to notice. This model is still enough for most tasks.

[-]

wombweed@reddit (OP)

Yeah someone else pointed out for me my negative experiences are with fp8 not q8, I hadn’t realized they are distinct

[-]

GreenPastures2845@reddit

Timeline:

Cache quanting is old functionality by now but it was always ill advised because of known accuracy degradation.

IK llamacpp had Hadamard quantization for K cache (the most sensible out of K and V) also since a long time, but it's an incremental improvement and not a night and day difference like Turboquant promised.

Since the Turboquant paper release (which is way more complex than simple K/V cache quanting or even Hadamard), there's been a lot of talk about cache quanting.

Mainline llamacpp then implemented Hadamard for both K/V, and IK llamacpp extended it for V as well; as of today, both only support up to Hadamard but NOT the full Turboquant yet. Apparently integrating it is non trivial.

The disconnect is that people rave about the Turboquant promised results, not existing implementations.

[-]

No_Hunter_7786@reddit

Fully agree. KV cache quantization is fine for casual chat but the moment you have tool calling or multi-step reasoning it falls apart fast. 16 bit KV is non-negotiable for anything agentic in my experience too.

[-]

Due-Function-4877@reddit

I think vibe coders running their agent on a potato feel significant pain from a mistake or a failed tool call. They don't know how to fix the errors (have no degree or experience) and retries from the agent take a long time for them.

If you have experience and a 5090, you'll be able to tolerate those things better than most users in the sub.

[-]

Daniel_H212@reddit

It depends on the model and depends on the task. Some models are more sensitive to it, and some tasks are precise enough that they don't tolerate it, like coding.

[-]

SteppenAxolotl@reddit

You have 24GB x 2 vram to play with. Your trade off mix will be very different than someone with 24GB or less.

[-]

Blues520@reddit

This is an interesting discussion. People have been chasing t/s lately but for coding especially, kv cache quantization decreases accuracy.

The tradeoff is lower context but the agentic workflows have been driving the higher context workload requirements. This is a good reminder to keep kv cache in check.

[-]

Lucerys1Velaryon@reddit

Interesting. I'm testing the exact same scenario for a VRAM constrained system (1x 7700 XT) but for a different Qwen model (3.6 35b-a3b). My sentiments mirror yours but I have not evidence to back it up. For long horizon tasks I feel like the model starts to degrade if KV cache is quantized (Failed tool calls - specifically making tool calls inside the reasoning block which causes it to return an invalid response, getting stuck in reasoning loops), but I still have to do more testing.
Will be interesting to see what other people think about this.

[-]

comanderxv@reddit

I observed the same behaviour with quantized cache e.g. f16/turbo3, q8/q8. With turboquant it got worse. Not measureable but noticable. Sometime it just stopped in the middle of the sentence and when I told it to continue it just stopped again.

Since then, I do not quantize kv cache any more. I tested a lot of models and they showed a similar behaviour. But my daily driver, and where this happened a lot is cured now (Qwen3.6 35B A3B).

In the meanwhile I use the Q2 model for development and Q6 for feature planning. Works good so far without issues.

[-]

nickm_27@reddit

Depends on your exact constraints. For long context or very tight use cases like coding it matters a lot.

Using Gemma4 26B for voice assistant which involves tool calling and multi-step decision making, Q8 cache in llama.cpp has no penalty in actual usage.

[-]

One-Replacement-37@reddit

2x 3090 in TP=2 mode doesnt make it meaningfully faster. You’d want DP=2 instead for 2x inference speed. That means you need an INT4 model, and TQ enabled. This repo has recipes for Qwen 27B max context on single 3090s using 50+ vLLM patches: https://github.com/noonghunna/club-3090

[-]

stoppableDissolution@reddit

People toy around with oneshot benchmarks and yea, it does not matter for that. I dont know anyone using kv quantization for any kind of actual work.

[-]

Tiny_Arugula_5648@reddit

Anytime accuracy is necessary like with coding and tool calling, real world business use cases you need to avoid quantization. TLDR every token predicted will have a lower accuracy which compounds with each new token generated.. chatbot users will barely notice that deviation but a compiler to parsing engine absolutely will..

[-]

StupidScaredSquirrel@reddit

Quantisation of the model itself hurts performance too. Ultimately, it's always a tradeoff problem. But if you already need to quantise to 4bit to run a model, you won't mind quantising kv to q8 for extra context and have say 64k instead of 32k context at which point it's way too limited.

[-]

Prudent-Ad4509@reddit

Kv quantization works fine for things like creative writing and stuff. It breaks in edge cases, and it breaks when you need something exact like with coding.

I would still use it for analysis of a very large codebase if it allows me to have more context, but not for code changes.