Qwen 3.5 27B - quantize KV cache or not?

[-]

AppealSame4367@reddit

Rather not or only slightly. qwen3.5 architecture is very sensitive to kv cache quantization. You should stay at bf16 or at most go down to q8\_0 Also, at least in llama.cpp CUDA linux, it doesn't allow mixed kv cache quantizations -> seg fault

Reply

[-]

voyager256@reddit

Any reliable tests that show Q8 or FP8 cache causing significant degradation? From my limited experience and even more from what I’ve seen re seems worth it .

Reply

[-]

AppealSame4367@reddit

There are some newer tests, that show that even Qwen3.6 don't reach the same intelligence in complex benchmarks and real usage tests anymore when going to Q4 for kv cache or certain combinations of turboquants. q8 seems to be fine for Q3.6

Reply

[-]

voyager256@reddit

Yeah, that’s been commonly known for below Q8 , I believe no one argues that something like Q4 for KV cache is not significantly degrading quality as it’s obvious even with with simple tests. Turbocache implementations are not much better either (at least so far ). My question was specifically regarding Q8 or FP8 for the cache.

Reply

[-]

AppealSame4367@reddit

Things have changed since Qwen3.5

Reply

[-]

voyager256@reddit

I things are moving very fast in LLM space . But I think regarding cache quantization not much changed: Q8 and FP8 seems really worth it , if VRAM constrained and perfectly fine for most practical purposes and Q4 is shit ( including TurboQuant variants) .

Reply

[-]

Delicious_Box_9823@reddit

turboderp directly told me that qwen 3.5 is less affected by cache quantization in terms of quality because, quoting: "3/4 of the layers are linear attention and don't use a cache at all". So i'm pretty sure you can go to as low as q4 and be fine.

Reply

[-]

Adventurous-Gold6413@reddit

For me with the 27b, It’s either I go 12k context with bf16, or 20k context with q8_0 cache, but the problem is it’s q3_km unsloth quant. Do you personally think Q3‘s are still usable?

Reply

[-]

Prudent-Ad4509@reddit

UD-IQ3\_XXS works great for me in opencode. Leaps and bounds over 35B. I run it with default cache quant in llama-server (f16). I've tried BF16 a as others have recommended but run into issues. Could be me, could be llama-server, I'll get back to investigating when I have a reason to.

Reply

[-]

Adventurous-Gold6413@reddit

Do you think UD_IQ3xxs is better than q3_km? I only got 16gb vram

Reply

[-]

DragonfruitIll660@reddit

At 16GB of VRAM try running a IQ4XS, should be able to fit I think it was 20k context at BF16. Had good luck with it so far.

Reply

[-]

Adventurous-Gold6413@reddit

Hmm.. I might try it but iq4xs is already 15gb Without vision. Do idk how the hell it should fit 20gb within the 16gb too

Reply

[-]

Prudent-Ad4509@reddit

At such low quant level any UD should be better than comparable non-UD. But depending on the speed you need, you might want to use higher quant, since you are offloading a lot into ram anyway. This depends on your ram+vram limit.

Reply

[-]

Mart-McUH@reddit

Not really. That holds somewhat for MoE, though other people like AesSedai also make smart dynamic MoE quants. For dense, there is no special magic with UD compared to say bartowski quants, which many people even find better/more stable. IMO it is just matter of taste unless some special cases where 4bit quant from Unsloth were bad I think due to adding some FP4 layers or something. But I think UD3 did not have this problem.

Reply

[-]

Prudent-Ad4509@reddit

Hand-crafted quants made by people who optimize and test them specifically on a case-by-case basis are playing in the same category as UD quants. You win some, you lose some by choosing between them, depending on that they were optimized for. Also, unsloth quants had plenty of issues with Qwen3.5 themselves, same as with Qwen3/Next, but they seem to be sorted out by now. So, UD is a safe bet, while default generic auto quant (as well as old UD quants) is a losing bet. Everyone else's quants can be better or worse for a particular purpose.

Reply

[-]

grumd@reddit

From my testing with Aider benchmark, 27B IQ3_XXS scored a bit lower than 35B. But 27B IQ4_XS scored higher than 35B. Those benchmarks have variance though, so idk

Reply

[-]

Prudent-Ad4509@reddit

I was a bit in a hurry between two threads. That quant I mentioned was for 122b. I would not go lower than any flavor of 4 for 27b. So, some versions of q3 or even lower are usable for larger models. The answer remains Q8 for cache first and then to look for ways to increase vram (or get a different hardware). 96-128gb seems to be a sweet spot for a small local llm right now.

Reply

[-]

AppealSame4367@reddit

I have to do the same and i think the results for such short context is still quite good, even at q3. Then again, depends on what you do. Agentic needs 60k-90k+ context, so i assume you just chat with it and in that case you could be better of with 4B and a better quant, better kv quant at around 20k context. Would be faster, too. Sometimes, for fun, i run 27b or 35b on my laptop and watch it crawl at 1-3 tps, but it's still nice to know such a thing could run on it. (laptop has 32gb ram, 6gb vram)

Reply

[-]

heislera763@reddit

I think I ran into this before but if you build with GGML_CUDA_FA_ALL_QUANTS=1 you can do mixed quants, it makes build times a bit longer though

Reply

[-]

mp3m4k3r@reddit

I found that adding this and also limiting the build to whatever cuda compute capacity your card actually supports with CMAKE_CUDA_ARCHITECTURES actually still saved time for me since it was compiling all of the cuda architectures and all the KV Quants

Reply

[-]

AppealSame4367@reddit

Thx for the hint, will test soon!

Reply

[-]

dinerburgeryum@reddit

Yep, beat me to it. The hybrid architecture really matters during these kinds of decisions. Don’t touch K-cache. V-cache no less than 8-bits.

Reply

[-]

Lissanro@reddit

Q8 cache may cause it go into thinking loops more often, or do mistakes it usually makes not that often. You still may try it and see it if it works for your use case, but you most likely have better experience going with Q5 quant with 16-bit cache instead of Q6 quant with Q8 cache. Q4 cache is an obvious brain damage, but again, you can test if yourself in your specific use cases. I recommend testing against lower quant with 16-bit cache so you can see the difference and decide what is better based on your actual experience.

Reply

[-]

voyager256@reddit

Are there any measurements for this? I mean Q8 cache causing more thinking loo\[s?

Reply

[-]

Lissanro@reddit

Please note that my comment was written months ago, so things changed drastically since then. If you are using recent llama.cpp, I am assuming you are rotating your activations to reduce quality loss from quantizing the cache, so you will be getting better results than at the time when I wrote my comment. Important to understand that quantization of the model itself and of its cache are completely separate from each other. Also, not all models have quants higher than Q4, for example for Kimi K2.6 Q4\_X is the highest possible precision since it is the one that most accurately preserves the original INT4 quality in the GGUF format. The best possible quality for cache is F32, followed by BF16, and then F16 and lower. F16 works very well in most cases, even though there were rare exceptions where BF16 or F32 work better, but generally, there are no further improvement beyond F16 with most models, which also well supported on most GPUs (unlike BF16 for example), I guess this is why F16 is the default. I am still using Kimi K2.6 with F16 cache because not only it reduces possibility of quantization errors but also gives me a bit more performance than if I put more layers on GPUs. As of measurements with different cache quantization levels, the latest measurements that are up to date can be found in the patch that introduced the rotation: [https://github.com/ggml-org/llama.cpp/pull/21038](https://github.com/ggml-org/llama.cpp/pull/21038) (which means that any tests for different quantization levels you may find that predate merge of this patch are out of date).

Reply

[-]

voyager256@reddit

Yeah I knew about most of that , except specifically the slight improvement for the Q8 thanks to the PR merged recently into llama.cpp , you linked. Plus I don’t run Kimi 2.6 no matter the quant:) I know there are some slight differences between Q8 and FP16/BF16 , but my question was specifically regarding the thinking loops and such . Because if confirmed that would probably be something to convince many to not use Qwen3.5 or 3.6 27B with Q8 or FP8 cache . Again from what I saw in practice (but also in simple tests) it really seems like it’s better to go with higher than Q4 for the model and use Q8 for the cache.

Reply

[-]

Lissanro@reddit

I the past, before rotation of activations was merged, I saw multiple people reporting issues with thinking loops even with f16 cache in Qwen3.5 models, with Q8_0 cache being worse, while confirming that bf16 working fine; one of most detailed reports that I saw at the time, with multiple cache quantizations tested, was this one: [https://www.reddit.com/r/LocalLLaMA/comments/1rii2pd/comment/o865qxw/](https://www.reddit.com/r/LocalLLaMA/comments/1rii2pd/comment/o865qxw/) >With the Qwen3.5 models its extremely important to use bf16 for the kv cache.... (especially in thinking mode) i strugled in the start too... but after changeing the k cache to bf16 and the v cache to bf16 and using the unsloth dynamic q4\_k\_xl quants they are absolutely amazing.... >update: kv cache settings i tested where >f16 == falls into a loop very very very often bf16 == works pretty well 99% of the time q8\_0 == nearly always loops in long thinking tasks q4\_1 == always loops q4\_0 == not useable, model gets dumb Also, when using cache quantization, context length matters: if you are using small context length you are less likely to encounter issues than if you go over 200K+ token long context.

Reply

[-]

Spicy_mch4ggis@reddit (OP)

Cheers, yea I thought kv cache quantization was bad but gemini kept trying to gaslight me lol

Reply

[-]

My_Unbiased_Opinion@reddit

Q8 all day! I am using IQ4XS with Q8 KVcache with like 190k context. It's insanely good.

Reply

[-]

ambient_temp_xeno@reddit

Was the use bf16 instead of fp16 kv cache thing for qwen 3.5 real?

Reply

[-]

mp3m4k3r@reddit

Llama.cpp will default to f16 if not told otherwise, bf16 on my ampere card performs worse than f16

Reply

[-]

ambient_temp_xeno@reddit

It might turn out that bf16 is better for the mmproj. I guess I will just have to get both and test.

Reply

[-]

mp3m4k3r@reddit

Does your GPU support bf16? I've been running just f16 on the mmproj as quant itself though haven't attempted to mess with the kvcache for it since its fairly secondary for me

Reply

[-]

ambient_temp_xeno@reddit

I don't believe so. I have 3060s. I'm led to believe that for CUDA, llamacpp doesn't support flash attention with bf16 at all, regardless of card.

Reply

[-]

ambient_temp_xeno@reddit

As far as I can work out it was someone's incorrect testing that made it appear to work better, but of course in 2026 people spread headlines at the speed of clickbait and they persist in search results.

Reply

[-]

mp3m4k3r@reddit

I run most all of my models at q8_0 and have played with those values a bit, I have seen 27B do repetition more than 9B or 35B, but this was resolved by making sure to use the right settings for the rest of the model from the model card. The only times I move back to f16 (bf16 is slower on my ampere cards) is for embeddings. I have also tried mixing values q8_0(K) and q4_0 (V) for example and it definitely seemed to degrade much further the output than locking them in the same quant for whatever reason, if you do want to experiment.

Reply

[-]

ClearApartment2627@reddit

A previous comment by u/dinerburgeryum sums up the relevant info very well: [https://www.reddit.com/r/LocalLLaMA/comments/1q97081/comment/nyt7vc8/?utm\_source=share&utm\_medium=web3x&utm\_name=web3xcss&utm\_term=1&utm\_content=share\_button](https://www.reddit.com/r/LocalLLaMA/comments/1q97081/comment/nyt7vc8/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) In short, you would want a server that applies hadamard rotation to k-values at least, and you can get that from ik\_llama.cpp or exllama3. That reduces the loss from quantization and makes the cache useable in q8.

Reply

[-]

ambient_temp_xeno@reddit

I think they only recommend such a high context window to avoid running out. I can't see any mechanism where it would affect the quality of the responses as long as they fit in whatever lower context you give it.

Reply

[-]

Spicy_mch4ggis@reddit (OP)

Thanks! I took their information at face value but through use 80k context seems fine. I would optimize if I had a use case like large code repo and more multi files, but as of now I didn’t need to get larger context window unless the model performance was being limited without me knowing

Reply

[-]

TKristof@reddit

I've been using it at q8 kv cache for a while now and I don't really see any degradation compared to bf16 bh. I don't really use it for code generation much though. I mostly use it to review my commits before pushing (in opencode) or for chatting (in openweb ui). Never seen any tool call fails so far even at 80-100k context.

Reply

Reply to Post

40 Comments