Developers who use local AI - Q4_0 vs Q8_0 KV quant?
Posted by Jorlen@reddit | LocalLLaMA | View on Reddit | 78 comments
I'd love to hear from developers who use big context windows if they notice a difference?
Obviously I would love to cut the KV cache VRAM requirement in half, but I'm worried about quality especially when we enter into 50k+ context territory.
I don't really need a full study, just wondering, anecdotally, what people have experienced.
My current setup: Llama.cpp server, Vulkan, 32GB VRAM, using mostly Qwen 3.6 models for development. I go back and forth beetween the 27b dense and 35b MoE. WIth a dash of the lil guy (3.5 9B omnicoder variant) for smaller stuff since it's so zippy and uses a shite-ton less vram.
Stepfunction@reddit
The quality loss at Q4 is pretty severe. I'd recommend the Q5_1 option instead.
No-Setting8461@reddit
Have you tried q4 with attn-rot though? I never even considered anything below q8 before it was merger, but now I'm using q5_1/q4_0 for gemma 4 cus cache size bloats so much and its been pretty OK.
Icy_Butterscotch6661@reddit
These really slow down generation
DinoAmino@reddit
Every optimization choice is a give and take. Given how much talk there is about tps speed benchmarks it would seem many here don't care as much about accurate responses.
Vusiwe@reddit
God save us all
sonicnerd14@reddit
From what I understand it depends. It will speed up generation speeds in some cases. So it's not like it's an exact science.
Stepfunction@reddit
Good call out. Quantized cache impacting generation speed is definitely another consideration. I mostly use LPMs for creative generations, so it's not something I pay much attention to.
suicidaleggroll@reddit
I use zero KV quantization. Even Q8 is too much for coding tasks IMO.
soyalemujica@reddit
This is odd, I have been rocking Q8 for complex C++ tasks in backend + frontend development, editor and game-server, i have not once experienced any hallucinations or even issues with numbers in a wide range of X-Y-Z coordinate in script files as well at over 100k context
graypasser@reddit
It just depends on harness itself, if the harness forces proper testing, it's unlikely to have problem in coding tasks.
MutantEggroll@reddit
Assuming you use llama.cpp - do you still find this to be true after `attn-rot` got merged? I used to be a hardline unquantized KV guy too, but I tried `q8_0` with `attn-rot` and I can't tell the difference in the coding tasks that I tend to give it (Python, PowerShell, Ansible).
suicidaleggroll@reddit
I haven’t tried since attn-rot was added, but my understanding of benchmarks at the time is that it only really helped for Q4 and had basically no effect at Q8.
MutantEggroll@reddit
It's definitely worth checking out. Though it did have a larger impact on lower-quant KV like `q4_0`, it did also improve `q8_0`. Check out the KLD charts that AesSedai posted in the
original PR.
It's certainly model and use-case specific, but in my experience, I haven't noticed any capability loss - no reasoning loops, no tool call failures, etc. And I was able to double my context window, which really broadened the tasks I could hand to my models (mostly Qwen3.6-27B and Gemma4-31B).
sonicnerd14@reddit
It depends on the model. Some models can handle q8 kv quant rather well like qwen 27b. Q4 is definitely too much though.
Jorlen@reddit (OP)
Can you explain? I'm new to this (few weeks ago) so I don't quite get it. Does this mean it defaults to full FP16 quant for KV?
suicidaleggroll@reddit
Yes default is F16, anything less than that is quantized, to varying effect. While model weights can typically be quantized to Q8 without any appreciable loss in accuracy, and Q4 with only minimal loss, the same is not true for the KV cache.
arbv@reddit
FP16 is, technically, not a quant for KV. You may also consider BF16 for models trained in this precision (most recent ones are) if your hardware supports it (modern hardware does). KV cache quantisation is trading precision for VRAM space, even Q8_0.
Valuable_Touch5670@reddit
I second this. For me, it’s more about overhead. For some reason my TG drops quite a bit whenever KV quantization is enabled.
May be anecdotal to just my HW setup or llama-server settings…
jessez05@reddit
This
fasti-au@reddit
Turbo quant and dflash. Beellama
unjustifiably_angry@reddit
Q4 is unusable and Q8 is supposed to be nearly perfect but in my testing I find F16 more reliable. Might be placebo. Q4 is unusable though.
NigaTroubles@reddit
For me kinda usable to 64k Thats my limit qwen3.6 35b a3b Q8 MTP
Jorlen@reddit (OP)
What KV quantization do you have in your setup? Q4_0 vs Q8_0. Not model quant, but KV cache quant. For example:
From my docker yaml llama snippet:
NigaTroubles@reddit
k q8_0
v q4_0
Jorlen@reddit (OP)
Interesting, didn't know you could mix the two. I'll try this setup. I found this:
Icy_Butterscotch6661@reddit
It'll slow down generation
Jorlen@reddit (OP)
Not noticing much change. I'm on Q8 for K and Q4 for V, 250k context. I downloaded a smaller quant of the model I was using and it seems fine so far but I haven't delved into the 150k context range yet; we'll see. I ballooned my current project on purpose, to test a huge amount of files and .MD architecture documents with a local agent specifically to test context boundaries; specifically to find when things break down.
Model used now: Qwen3.6-35B-A3B-UD-Q4_K_XL. Q8K--Q4V quant at 250k context fills my VRAM to 25/32gb. Speed is excellent, but would likely suffer when using the 27b dense model of 3.6
Icy_Butterscotch6661@reddit
Interesting, that was based on me comparing things like 3 weeks ago on one of the qwen3.6 models. Using q8 and q4 together, or using q6(?) was resulting in a significant slowdown. Q8+q8 or q4+q4 was faster.
do011@reddit
I confirm, q8+q8 41t/s, becomes 6.5t/s for q8+q4 in agent (in webui it still shows as 41t/s).
Jorlen@reddit (OP)
Negligible change for me. Maybe 5 tok/sec difference. I'm a little over 100 tok/sec but it does slow down once I enter the 100k+ context realm, which I don't plan to do often outside of just testing / fucking around.
ea_man@reddit
You may have a broken llama pull, happened often in recent weeks.
jrodder@reddit
I fought this all day yesterday. At least with MTP, mixing the cache quant had me reverting to CPU. I ended up having to keep them both at Q8_0.
Alternative-Cat-1347@reddit
Why are you limiting it to 64k? in my case I often pass 200k and things are still fine. KV q8_0 for both
Operation_Neither@reddit
Whatever fits in VRAM
rpkarma@reddit
BF16 KV cache. Everything else has notable degradation of accuracy in all of my evals
FoxiPanda@reddit
Models at Q5/Q6/Q8 and KV cache at bf16 where I can keep a reasonable context size, Q8_0 where I can't.
superdariom@reddit
I have 24gb and run qwen 3.6 35b Q8 with full 256k context with no quantisation. You can run even faster than me I expect. I offload Moe to CPU until it fits and also use ubatch 4096 batch 3072
Rikers88@reddit
This is my go to
Beellama Qwen3.6 27b UD q4 K xl 350k context KV cache: K turbo4, V turbo3 DFlash : drafter model from spiritbuun Q8
It's working good for me on coding. If you want I can share the complete command I use to spawn the server.
To increase quality I would suggest to go Q8 on the K of the kv cache.
When I was running Q8 on the K of the cache, I had almost zero errors on tool usage with Cline as coding agent, while with this new setup instead it happens more often. Not a big deal since then Cline retries.
5090 here
gazzamc@reddit
Would you mind sharing the command? I'm interested. Thanks.
Jorlen@reddit (OP)
Holy smokes. Talk about bleeding edge of a bleeding edge! I didn't know about this Beellama fork, it seems interesting!
Are you using the docker image or local install?
Rikers88@reddit
Nope I'm on Linux and I compiled it from the source. If you're on windows I suggest you to compile it on docker image instead.
Beellama is really good as it has speculative decoding, which is faster than the MTP they just merged in the main repo, and also have turboquant which is really good compression, better than the standard Q4. If you give it a read to the paper you'll see, that algorithm is really smart.
If you can go fp16, no questions, Q8 it's ok especially if you have native hardware for that, but if you have to go q4 or lower then turboquant is a must in my opinion.
PS if you use context longer than 256k, then use RoPE for context extrapolation up to 1M.
RevolutionaryLime758@reddit
If you quant your cache you might be an idiot
laul_pogan@reddit
Running Qwen 27B agentic daily at long context: the split
-ctk q8_0 -ctv q4_0is the practical sweet spot. K cache holds attention patterns and drives recall precision; V cache holds value projections and tolerates lower quant better. Pure Q4_0 on both degrades noticeably above 50k, especially on structured output and tool call fidelity (as diffore noted). K8/V4 gets you roughly 37% VRAM savings vs pure Q8 with almost no measurable quality hit in my testing. Q5_1 on both is also solid if you want the simpler config. What I avoid is Q4 on K specifically; that's where long-context recall breaks down first.ea_man@reddit
Agreed, I run
-ctk q8_0 -ctv q5_1too up to \~75k context usually.Jorlen@reddit (OP)
I'm ramping up the context right now on my setup, currently q8/q4 mix and so far it's very promising. I don't want to get ahead of myself (like usual..) so I'll try to reserve my excitement but so far it's looking good. Need to eat into more context! I'm really putting it to the test lol. I'm on the 35B (moe) version as I prefer its speed.
noctrex@reddit
I'm using my Qwopus3.6-27B variant with MTP added, and use Q4 KV 128k.
It works surprisingly well.
I've tested this across multiple sessions and seems very capable, and does not seem to forget easily.
kar200@reddit
Can you share your command please? I have the same card and tested a few different options but happy to use whats been tested already.
Also what coding tool do you use with it please?
noctrex@reddit
philmarcracken@reddit
I lack as much vram as you, only 12gb 3080ti. Using your Qwopus3.5-9B-Coder-MTP-Q8_0 so curious if I can use similar flags? running ubuntu server, no monitor. Im guessing I have to reduce to q6 for that much ctx
noctrex@reddit
I'm using it with pi agent, and nanocoder
Potential-Leg-639@reddit
Q4 is enough for Qwen3.6
kapteinpyn@reddit
I run Qwen3.6 27B (Qwen3.6-27B-UD-Q6_K_XL (MTP))at 40tps tg with 131072 context at Q8 kv, one session. this has the best speed vs quality outcomes for me.
Reasonable_Flower_72@reddit
I’m cooking stuff with “hybrid” unsloth dynamic UD_Q5_K_XL, KV cache Q8.. doing Q4 KV cache is turning qwen 3.6 into mental.
3090+3060 , 180k ctx
Cmd:
phobeus@ai:~$ cat llama_qwen3.6 GGML_OP_OFFLOAD_MIN_BATCH=256 llama.cpp/build/bin/llama-server --host 0.0.0.0 -c 180000 -np 1 -ctk q8_0 -ctv q8_0 --no-mmap -fa on --hf-repo unsloth/Qwen3.6-27B-GGUF --hf-file Qwen3.6-27B-UD-Q5_K_XL.gguf --no-mmproj phobeus@ai:~$
aguspiza@reddit
Unless you really really need the VRAM/RAM go for q8_0... it is much better quality and for some reason you get slightly better performance, at least in pre-RTX CUDA cards.
Last_Mastod0n@reddit
Q4 loses too much quality for me. I usually choose the middle ground with an unsloth Q6 UD quant
Jorlen@reddit (OP)
Flipping to model quant - what have you observed from the quality loss of a really high quality 4-bit quant, such as UD Q4-K-XL? Vs a 6-bit quant? I am learning, so please tell me what your experience has been so far. Might save me some time.
Last_Mastod0n@reddit
So i havent actually tried the UD Q4 quant. I just moved straight from the base Q4 to UD Q6. Im sure a large part of the increase was coming from unsloths UD model.
I noticed a measurable increase in vision capabilities in particular. I also noticed it was better at matching a string X to the most similar string Y from a list of string Y options. I have not explicitly used it for coding yet other than some basic tests.
DeepVegetable@reddit
We're talking about the KV cache quantization, not the quantization of the model
Last_Mastod0n@reddit
Oh my bad. I need to learn to read 😂
tmvr@reddit
Stick to q8_0 for both K and V if you need space for more context.
hulk14@reddit
Q4_0 KV is usually fine until really long contexts, but once you push into 50k+ I start noticing more confusion, repetition, and weaker recall compared to Q8_0.
WPO42@reddit
I was wondering why... thx !
2Norn@reddit
depending on the model q8 is almost indistinguishable or terrible
Prudent-Ad4509@reddit
You do know that the correct answer is 16, right? as well as >64gb vram and at least Q8 model itself. Until then... it is passable, but you will stumble into the limitations pretty often.
ttkciar@reddit
It depends on the model, to a degree. Some are more sensitive to K/V cache quantization than others. Gemma 4 is particularly sensitive to it, for example.
Most models work fine with Q8_0 K/V cache quantization with little or no degradation. Gemma 4 shows noticeable degradation, but it's not too bad. If you really need to eke out a little more context space from your limited VRAM, it's a reasonable trade-off.
Q4_0 K/V cache quantization is a no-go. Significant competence degradation is evident for all models, and Gemma 4 acts like it's been lobotomized.
audioen@reddit
fp16 for KV, Q8_0 for model, and the 27b only because it is the only one that I think is good enough for largely unsupervised coding. I have not detected obvious degradation with the rotated q8_0 KV cache that llama.cpp has these days, but I've not been interested in using it either because it confers no speed benefit and I have the VRAM on a Strix Halo either way.
fragment_me@reddit
If you're using F16 KV cache might as well try BF16 (if your hardware can handle it). See here for some KLD benchmarks for Qwen3.5. https://techstat.net/qwen3-5-27b-q8-kv-cache-benchmarks-bf16-vs-f16-vs-q8_0/
fragment_me@reddit
It depends on the model quite a bit I'm learning based on various benchmarks. Q8 usually is pretty good but degrades at long context. I wouldn't go lower. I personally stick to native KV cache quant now. For Qwen, that's actually BF16, not F16 as the default in llama CPP is. If you really want to go lower, reduce V but keep K higher. E.g. K as BF16 and V as Q8_0.
eelkir@reddit
It seems to depend heavily on model, Gemma doesn't perform nearly as well with KV cache quantization as Qwen apparently: https://localbench.substack.com/p/kv-cache-quantization-benchmark
Karyo_Ten@reddit
Fp16/BF16
Quantized KV cache makes hit in accuracy and also performance since it needs to be dequantized.
jacek2023@reddit
pmttyji@reddit
After last month PR merge, Q8 is giving almost F16 quality. The PR has numbers for Q5 & Q4 too.
Mordimer86@reddit
K: q5_1, V: q4_1
hurdurdur7@reddit
Model q6 and up, context cache fp16
Adventurous-Gold6413@reddit
Q8
diffore@reddit
Lesser quant == more tool call errors. So it depends on harness and model, how good both of them at error recovering. If I can - I don't quantize cache.
shaonline@reddit
Q8_0 is fine for the most part (I think several comparisons have been posted for Qwen 27B on this subreddit), Q4_0 introduces a small quality loss (Qwen is fairly resilient to quantizations it seems), generally this small of a model isn't really worth using with long contexts anyway so I'd stick to Q8 in your case.
Great_Guidance_8448@reddit
I haven't seen any degradation with KV Q8_0. Running Qwenn 3.6 27B in Cline with 105k context on a mobile RTX 5090 24 gig VRAM.