Developers who use local AI - Q4_0 vs Q8_0 KV quant?

[-]

Stepfunction@reddit

The quality loss at Q4 is pretty severe. I'd recommend the Q5_1 option instead.

[-]

No-Setting8461@reddit

Have you tried q4 with attn-rot though? I never even considered anything below q8 before it was merger, but now I'm using q5_1/q4_0 for gemma 4 cus cache size bloats so much and its been pretty OK.

[-]

DinoAmino@reddit

Every optimization choice is a give and take. Given how much talk there is about tps speed benchmarks it would seem many here don't care as much about accurate responses.

[-]

sonicnerd14@reddit

From what I understand it depends. It will speed up generation speeds in some cases. So it's not like it's an exact science.

[-]

Stepfunction@reddit

Good call out. Quantized cache impacting generation speed is definitely another consideration. I mostly use LPMs for creative generations, so it's not something I pay much attention to.

[-]

suicidaleggroll@reddit

I use zero KV quantization. Even Q8 is too much for coding tasks IMO.

[-]

This is odd, I have been rocking Q8 for complex C++ tasks in backend + frontend development, editor and game-server, i have not once experienced any hallucinations or even issues with numbers in a wide range of X-Y-Z coordinate in script files as well at over 100k context

[-]

graypasser@reddit

It just depends on harness itself, if the harness forces proper testing, it's unlikely to have problem in coding tasks.

[-]

MutantEggroll@reddit

Assuming you use llama.cpp - do you still find this to be true after `attn-rot` got merged? I used to be a hardline unquantized KV guy too, but I tried `q8_0` with `attn-rot` and I can't tell the difference in the coding tasks that I tend to give it (Python, PowerShell, Ansible).

[-]

suicidaleggroll@reddit

I haven’t tried since attn-rot was added, but my understanding of benchmarks at the time is that it only really helped for Q4 and had basically no effect at Q8.

[-]

MutantEggroll@reddit

It's definitely worth checking out. Though it did have a larger impact on lower-quant KV like `q4_0`, it did also improve `q8_0`. Check out the KLD charts that AesSedai posted in the
original PR.

It's certainly model and use-case specific, but in my experience, I haven't noticed any capability loss - no reasoning loops, no tool call failures, etc. And I was able to double my context window, which really broadened the tasks I could hand to my models (mostly Qwen3.6-27B and Gemma4-31B).

[-]

sonicnerd14@reddit

It depends on the model. Some models can handle q8 kv quant rather well like qwen 27b. Q4 is definitely too much though.

[-]

Jorlen@reddit (OP)

Can you explain? I'm new to this (few weeks ago) so I don't quite get it. Does this mean it defaults to full FP16 quant for KV?

[-]

suicidaleggroll@reddit

Yes default is F16, anything less than that is quantized, to varying effect. While model weights can typically be quantized to Q8 without any appreciable loss in accuracy, and Q4 with only minimal loss, the same is not true for the KV cache.

[-]

arbv@reddit

FP16 is, technically, not a quant for KV. You may also consider BF16 for models trained in this precision (most recent ones are) if your hardware supports it (modern hardware does). KV cache quantisation is trading precision for VRAM space, even Q8_0.

[-]

Valuable_Touch5670@reddit

I second this. For me, it’s more about overhead. For some reason my TG drops quite a bit whenever KV quantization is enabled.

May be anecdotal to just my HW setup or llama-server settings…

[-]

jessez05@reddit

This

[-]

fasti-au@reddit

Turbo quant and dflash. Beellama

[-]

unjustifiably_angry@reddit

Q4 is unusable and Q8 is supposed to be nearly perfect but in my testing I find F16 more reliable. Might be placebo. Q4 is unusable though.

[-]

NigaTroubles@reddit

For me kinda usable to 64k Thats my limit qwen3.6 35b a3b Q8 MTP

[-]

Jorlen@reddit (OP)

What KV quantization do you have in your setup? Q4_0 vs Q8_0. Not model quant, but KV cache quant. For example:

From my docker yaml llama snippet:

  - "--cache-type-k"
  - "q8_0"                        # KV cache quantization: 8-bit = high precision, 32GB headroom
  - "--cache-type-v"
  - "q8_0"

[-]

NigaTroubles@reddit

k q8_0

v q4_0

[-]

Jorlen@reddit (OP)

Interesting, didn't know you could mix the two. I'll try this setup. I found this:

Configuring asymmetric local LLM KV cache—where Key (K) uses q8_0 and Value (V) uses q4_0—is an excellent way to balance VRAM savings with generation quality. K cache is far more sensitive to quantization noise, while V cache can be heavily compressed with little to no impact on accuracy

[-]

Icy_Butterscotch6661@reddit

It'll slow down generation

[-]

Jorlen@reddit (OP)

Not noticing much change. I'm on Q8 for K and Q4 for V, 250k context. I downloaded a smaller quant of the model I was using and it seems fine so far but I haven't delved into the 150k context range yet; we'll see. I ballooned my current project on purpose, to test a huge amount of files and .MD architecture documents with a local agent specifically to test context boundaries; specifically to find when things break down.

Model used now: Qwen3.6-35B-A3B-UD-Q4_K_XL. Q8K--Q4V quant at 250k context fills my VRAM to 25/32gb. Speed is excellent, but would likely suffer when using the 27b dense model of 3.6

[-]

Icy_Butterscotch6661@reddit

Interesting, that was based on me comparing things like 3 weeks ago on one of the qwen3.6 models. Using q8 and q4 together, or using q6(?) was resulting in a significant slowdown. Q8+q8 or q4+q4 was faster.

[-]

do011@reddit

I confirm, q8+q8 41t/s, becomes 6.5t/s for q8+q4 in agent (in webui it still shows as 41t/s).

[-]

Jorlen@reddit (OP)

Negligible change for me. Maybe 5 tok/sec difference. I'm a little over 100 tok/sec but it does slow down once I enter the 100k+ context realm, which I don't plan to do often outside of just testing / fucking around.

[-]

ea_man@reddit

You may have a broken llama pull, happened often in recent weeks.

[-]

jrodder@reddit

I fought this all day yesterday. At least with MTP, mixing the cache quant had me reverting to CPU. I ended up having to keep them both at Q8_0.

[-]

Alternative-Cat-1347@reddit

Why are you limiting it to 64k? in my case I often pass 200k and things are still fine. KV q8_0 for both

[-]

Operation_Neither@reddit

Whatever fits in VRAM

[-]

rpkarma@reddit

BF16 KV cache. Everything else has notable degradation of accuracy in all of my evals

[-]

FoxiPanda@reddit

Models at Q5/Q6/Q8 and KV cache at bf16 where I can keep a reasonable context size, Q8_0 where I can't.

[-]

superdariom@reddit

I have 24gb and run qwen 3.6 35b Q8 with full 256k context with no quantisation. You can run even faster than me I expect. I offload Moe to CPU until it fits and also use ubatch 4096 batch 3072

[-]

Rikers88@reddit

This is my go to

Beellama Qwen3.6 27b UD q4 K xl 350k context KV cache: K turbo4, V turbo3 DFlash : drafter model from spiritbuun Q8

It's working good for me on coding. If you want I can share the complete command I use to spawn the server.

To increase quality I would suggest to go Q8 on the K of the kv cache.

When I was running Q8 on the K of the cache, I had almost zero errors on tool usage with Cline as coding agent, while with this new setup instead it happens more often. Not a big deal since then Cline retries.

5090 here

[-]

gazzamc@reddit

Would you mind sharing the command? I'm interested. Thanks.

[-]

Jorlen@reddit (OP)

Holy smokes. Talk about bleeding edge of a bleeding edge! I didn't know about this Beellama fork, it seems interesting!

Are you using the docker image or local install?

[-]

Rikers88@reddit

Nope I'm on Linux and I compiled it from the source. If you're on windows I suggest you to compile it on docker image instead.

Beellama is really good as it has speculative decoding, which is faster than the MTP they just merged in the main repo, and also have turboquant which is really good compression, better than the standard Q4. If you give it a read to the paper you'll see, that algorithm is really smart.

If you can go fp16, no questions, Q8 it's ok especially if you have native hardware for that, but if you have to go q4 or lower then turboquant is a must in my opinion.

PS if you use context longer than 256k, then use RoPE for context extrapolation up to 1M.

[-]

RevolutionaryLime758@reddit

If you quant your cache you might be an idiot

[-]

laul_pogan@reddit

Running Qwen 27B agentic daily at long context: the split -ctk q8_0 -ctv q4_0 is the practical sweet spot. K cache holds attention patterns and drives recall precision; V cache holds value projections and tolerates lower quant better. Pure Q4_0 on both degrades noticeably above 50k, especially on structured output and tool call fidelity (as diffore noted). K8/V4 gets you roughly 37% VRAM savings vs pure Q8 with almost no measurable quality hit in my testing. Q5_1 on both is also solid if you want the simpler config. What I avoid is Q4 on K specifically; that's where long-context recall breaks down first.

[-]

ea_man@reddit

Agreed, I run -ctk q8_0 -ctv q5_1 too up to \~75k context usually.

[-]

Jorlen@reddit (OP)

I'm ramping up the context right now on my setup, currently q8/q4 mix and so far it's very promising. I don't want to get ahead of myself (like usual..) so I'll try to reserve my excitement but so far it's looking good. Need to eat into more context! I'm really putting it to the test lol. I'm on the 35B (moe) version as I prefer its speed.

[-]

noctrex@reddit

I'm using my Qwopus3.6-27B variant with MTP added, and use Q4 KV 128k.
It works surprisingly well.
I've tested this across multiple sessions and seems very capable, and does not seem to forget easily.

[-]

kar200@reddit

Can you share your command please? I have the same card and tested a few different options but happy to use whats been tested already.

Also what coding tool do you use with it please?

[-]

noctrex@reddit

C:/Programs/AI/llamacpp-vulkan/llama-server.exe --port 8080 --metrics --jinja --ctx-checkpoints 256 --webui-mcp-proxy --parallel 1 
--model Q:/Models/LLM/Qwopus3.6-27B-v1-preview-MTP-IQ4_XS.gguf 
--cache-type-k q4_0 
--cache-type-v q4_0 
--ctx-size 131072 
--fit-ctx 131072 
--temp 0.6 --top-p 0.95
--top-k 20 --min_p 0.00 --presence-penalty 0.0 --repeat-penalty 1.0
--chat-template-kwargs "{\"preserve_thinking\": true}" 
--spec-type draft-mtp
--spec-draft-p-min 0.75
--spec-draft-n-max 3

[-]

philmarcracken@reddit

I lack as much vram as you, only 12gb 3080ti. Using your Qwopus3.5-9B-Coder-MTP-Q8_0 so curious if I can use similar flags? running ubuntu server, no monitor. Im guessing I have to reduce to q6 for that much ctx

[-]

noctrex@reddit

I'm using it with pi agent, and nanocoder

[-]

Potential-Leg-639@reddit

Q4 is enough for Qwen3.6

[-]

kapteinpyn@reddit

I run Qwen3.6 27B (Qwen3.6-27B-UD-Q6_K_XL (MTP))at 40tps tg with 131072 context at Q8 kv, one session. this has the best speed vs quality outcomes for me.

[-]

Reasonable_Flower_72@reddit

I’m cooking stuff with “hybrid” unsloth dynamic UD_Q5_K_XL, KV cache Q8.. doing Q4 KV cache is turning qwen 3.6 into mental.

3090+3060 , 180k ctx

Cmd:

phobeus@ai:~$ cat llama_qwen3.6 GGML_OP_OFFLOAD_MIN_BATCH=256 llama.cpp/build/bin/llama-server --host 0.0.0.0 -c 180000 -np 1 -ctk q8_0 -ctv q8_0 --no-mmap -fa on --hf-repo unsloth/Qwen3.6-27B-GGUF --hf-file Qwen3.6-27B-UD-Q5_K_XL.gguf --no-mmproj phobeus@ai:~$

[-]

aguspiza@reddit

Unless you really really need the VRAM/RAM go for q8_0... it is much better quality and for some reason you get slightly better performance, at least in pre-RTX CUDA cards.

[-]

Last_Mastod0n@reddit

Q4 loses too much quality for me. I usually choose the middle ground with an unsloth Q6 UD quant

[-]

Jorlen@reddit (OP)

Flipping to model quant - what have you observed from the quality loss of a really high quality 4-bit quant, such as UD Q4-K-XL? Vs a 6-bit quant? I am learning, so please tell me what your experience has been so far. Might save me some time.

[-]

Last_Mastod0n@reddit

So i havent actually tried the UD Q4 quant. I just moved straight from the base Q4 to UD Q6. Im sure a large part of the increase was coming from unsloths UD model.

I noticed a measurable increase in vision capabilities in particular. I also noticed it was better at matching a string X to the most similar string Y from a list of string Y options. I have not explicitly used it for coding yet other than some basic tests.

[-]

DeepVegetable@reddit

We're talking about the KV cache quantization, not the quantization of the model

[-]

Last_Mastod0n@reddit

Oh my bad. I need to learn to read 😂

[-]

tmvr@reddit

Stick to q8_0 for both K and V if you need space for more context.

[-]

hulk14@reddit

Q4_0 KV is usually fine until really long contexts, but once you push into 50k+ I start noticing more confusion, repetition, and weaker recall compared to Q8_0.

[-]

WPO42@reddit

I was wondering why... thx !

[-]

2Norn@reddit

depending on the model q8 is almost indistinguishable or terrible

[-]

Prudent-Ad4509@reddit

You do know that the correct answer is 16, right? as well as >64gb vram and at least Q8 model itself. Until then... it is passable, but you will stumble into the limitations pretty often.

[-]

ttkciar@reddit

It depends on the model, to a degree. Some are more sensitive to K/V cache quantization than others. Gemma 4 is particularly sensitive to it, for example.

Most models work fine with Q8_0 K/V cache quantization with little or no degradation. Gemma 4 shows noticeable degradation, but it's not too bad. If you really need to eke out a little more context space from your limited VRAM, it's a reasonable trade-off.

Q4_0 K/V cache quantization is a no-go. Significant competence degradation is evident for all models, and Gemma 4 acts like it's been lobotomized.

[-]

audioen@reddit

fp16 for KV, Q8_0 for model, and the 27b only because it is the only one that I think is good enough for largely unsupervised coding. I have not detected obvious degradation with the rotated q8_0 KV cache that llama.cpp has these days, but I've not been interested in using it either because it confers no speed benefit and I have the VRAM on a Strix Halo either way.

[-]

fragment_me@reddit

If you're using F16 KV cache might as well try BF16 (if your hardware can handle it). See here for some KLD benchmarks for Qwen3.5. https://techstat.net/qwen3-5-27b-q8-kv-cache-benchmarks-bf16-vs-f16-vs-q8_0/

[-]

fragment_me@reddit

It depends on the model quite a bit I'm learning based on various benchmarks. Q8 usually is pretty good but degrades at long context. I wouldn't go lower. I personally stick to native KV cache quant now. For Qwen, that's actually BF16, not F16 as the default in llama CPP is. If you really want to go lower, reduce V but keep K higher. E.g. K as BF16 and V as Q8_0.

[-]

eelkir@reddit

It seems to depend heavily on model, Gemma doesn't perform nearly as well with KV cache quantization as Qwen apparently: https://localbench.substack.com/p/kv-cache-quantization-benchmark

[-]

Karyo_Ten@reddit

Fp16/BF16

Quantized KV cache makes hit in accuracy and also performance since it needs to be dequantized.

[-]

jacek2023@reddit

./bin/llama-server -c 200000 -m /mnt/models2/Qwen/3.6/Qwen3.6-27B-Q8_0.gguf --host 0.0.0.0 --jinja -fa on --keep 4096 -b 8192 --spec-type ngram-mod --parallel 1 --ctx-checkpoints 24 --checkpoint-every-n-tokens 8192 --cache-ram 65536 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 --presence-penalty 0 --repeat-penalty 1.0 --spec-type draft-mtp --spec-draft-n-max 3

[-]

pmttyji@reddit

After last month PR merge, Q8 is giving almost F16 quality. The PR has numbers for Q5 & Q4 too.

[-]

Mordimer86@reddit

K: q5_1, V: q4_1

[-]

hurdurdur7@reddit

Model q6 and up, context cache fp16

[-]

Adventurous-Gold6413@reddit

Q8

[-]

diffore@reddit

Lesser quant == more tool call errors. So it depends on harness and model, how good both of them at error recovering. If I can - I don't quantize cache.

[-]

shaonline@reddit

Q8_0 is fine for the most part (I think several comparisons have been posted for Qwen 27B on this subreddit), Q4_0 introduces a small quality loss (Qwen is fairly resilient to quantizations it seems), generally this small of a model isn't really worth using with long contexts anyway so I'd stick to Q8 in your case.

[-]

Great_Guidance_8448@reddit

I haven't seen any degradation with KV Q8_0. Running Qwenn 3.6 27B in Cline with 105k context on a mobile RTX 5090 24 gig VRAM.