Are you quanting your memory?

Posted by Plastic-Stress-6468@reddit | LocalLLaMA | View on Reddit | 46 comments

Title.

Curious about how people are generally dealing with the kv cache. BF16? Q8? Q4? Turboquant or some other secret sauce?

I run bf16 everything hoping that I'd get less hallucinations and because that's what the g4 and q3.6 are natively trained on anyways. But very interested to hear if people are having good results running q8 or q4 or if anyone has good results using turbo3/4 or similar.

[-]

GoodTip7897@reddit

At about 70k ish context I was having an occasional failed tool call or other hallucination by Qwen 3.6 27B UD-Q5_K_XL at Q8_0 k/v cache with llama.cpp (rotated).

I switched to bf16 because I no longer have to worry about whether I'm lobotomizing my model. I don't like the idea of the q5 weights error compounding with q8_0 kv over tens of thousands of tokens.

I notice bf16 almost never fails tool calls.

[-]

florinandrei@reddit

What hardware?

[-]

GoodTip7897@reddit

7900xtx. Both vulkan and rocm backends tested.

[-]

LirGames@reddit

Try removing the Preserve Thinking option, there are known issues with it that are breaking tool calling. Since I removed it, I had a single tool call failing out of... Thousands I guess. UD_Q4_K_XL + 100K Ctx at Q8 with rotation (and I often end up using all the context).

[-]

soyalemujica@reddit

I have not experienced any tool call errors with q8 in q5km auto round. I get as far as 90k context

[-]

Evanisnotmyname@reddit

Also dependent on hardware capabilities

[-]

LirGames@reddit

Q8 with Hadamard rotations (without rotation BF16 is needed). Only minor issues above 100K context with Qwen3.6 27B UD_Q4_K_XL.

[-]

Plastic-Stress-6468@reddit (OP)

By issue do you mean tool call inconsistency or general incompetency?

[-]

LirGames@reddit

I mean forgetting some initial details while reasoning, or slightly increase the verbosity of a piece of code that could easily be better written. Nothing too serious that you wouldn't notice while code reviewing.

As for tool calling instability, I found that by removing the Preserve Thinking option, Qwen3.6 27B becomes 99.9% accurate. There are a few issues on the topic opened in both llama.cpp and all, but imho is something to be solved at the template level (essentially with the option on, Qwen ends up answering withing thinking tags, and that breaks whatever harness or tool was using it)

[-]

a_beautiful_rhind@reddit

I did my own testing like GG. For the models I use, both Q8/Q4 is fine with rotations. I try to keep Q8 and maybe go down to Q6 at the expense of extra context that I'm probably not going to use most of the time.

[-]

Far_Course2496@reddit

Has anyone tried f32? It's an option in llama-server. I tried it on Qwen 3.6 30B A3B Q6_K and it gave me weird output right out of the gate

[-]

Ardalok@reddit

Does anyone have experience with FP8 vs Q8 cache? Both in llama.cpp and other programs.

[-]

ThisGonBHard@reddit

Absolutely no.

Even with Q8, I see MAJOR degradation.

[-]

Plastic-Stress-6468@reddit (OP)

Very interesting dissenting opinion. Can you share more?

[-]

ThisGonBHard@reddit

There is not much to share, using ANY context quantization absolutely lobotomizes the models. Huge degradation that is very noticeable in code.

At Q4, model is outputting straight out gibberish if any remotely complex task is sent. It like using a model 10x smaller, at Q1 quant.

Also, both destroy image understanding too.

[-]

jacek2023@reddit

q8 is slower than default on the models I use right now, so no

[-]

FatheredPuma81@reddit

Just ran a brief test with 2 47000 token inputs running in parallel. Only difference is I changed Context Length from 409600 to 204800 for default so it would fit in VRAM.

The speed difference is so incredibly negligible on an RTX 4090. It's around 5000t/s prefill and 94t/s with the Default all the way down to a mind numbingly slow 4800t/s prefill and 89t/s at Q4_0 with Q8_0 landing in between the two.

[-]

Plastic-Stress-6468@reddit (OP)

I don't see people bring up Q5_1 much, how is it?

[-]

FatheredPuma81@reddit

Oh and you also need to build llama.cpp yourself with some command I can't remember (ask Gemini or Claude they can tell you) if you want hardware acceleration for the prefill.

[-]

Plastic-Stress-6468@reddit (OP)

Do you mean -DGGML_CUDA=ON for GPU acceleration or another flag?

Asked a few models and then went to the build readme and didn't find any specific flag needed for hw accelerated prefill.

[-]

FatheredPuma81@reddit

I'm kind of surprised they weren't able to give it to you actually. I opened my auto update script and found it it's -DGGML_CUDA_FA_ALL_QUANTS=ON

[-]

FatheredPuma81@reddit

Performance wise seems fine. _1 should be better than _0 quality wise. https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4150413357

[-]

Plastic-Stress-6468@reddit (OP)

Q4 losses too much information and Q8 is too slow so take the context hit and stick with bf16?

[-]

No_Algae1753@reddit

Yes i was experiencing the same. Q8 is just too slow at longer context.

[-]

apeapebanana@reddit

exactly what i found out 2 days ago, and yesterday tried vllm which on Q8 with good speed. Qwen3.6 27b works way better with it to my surprise. Config a pain, but could use cloud-llm for support. worth a try.

[-]

jacek2023@reddit

I use 200000

[-]

KURD_1_STAN@reddit

With 12gb vram i really only have 2 options which are qwen 35b and gemma 26b, and both run at an acceptable speed with fp16 with offloading some moe layers to cpu. If 27b could be fit at q3 then i would have used kv q8(probably). So im at fp16 but not by choice

[-]

fredandlunchbox@reddit

TQ4/TQ4, but I want to switch to Saw. Qwen3.6 27B unsloth 5k xl. as a coding agent. Full 260k context. I use it as a tool call with my claude to save tokens. Claude plans, Qwen implements.

[-]

Pentium95@reddit

I use Q8, expect when i run both text gen and image gen, when i use Q5_1 KV cache quantization. But i never use more then 40k context.

[-]

kevin_1994@reddit

I never used memory quants until the attention rotation feature was merged into llama.cpp. Now I run at -ctv q8_0 -ctk q8_0 for qwen 3.6 models and it works great. Don't notice any degradation.

[-]

Glum-Atmosphere9248@reddit

Is it the attn-rot flag?

[-]

Pentium95@reddit

There Is no flag, It Is always active

[-]

homak666@reddit

I use 8 with Qwen 3.6 35b. I don't notice performance degrading from it, and I can fit way more context in my limited VRAM

[-]

superdariom@reddit

I've run turbo3 with qwen 3.5 and 3.6 both Moe and dense with triattention as well at 256k context and the biggest problem I had overall was speed. Now I'm using 3.6 Moe 256k context with the stock q4_0 kv. I haven't seen any errors really at all except maybe it might use pycharm MCP to exec a shell command when that isn't the preferred tool but I can be more specific in my prompt and that resolves it.

[-]

ayylmaonade@reddit

Nope. I remember people saying Qwen 3.5/3.6 at q8 KV was "basically free" and using KLD numbers to back it up, but at long contexts above 60K+, I find that q8 struggles while FP16 does just fine.

It's not worth it unless it's the only way to fit into your system, imo.

[-]

Sufficient_Sir_5414@reddit

Curious if anyone has benchmarked hallucination rates vs KV precision directly, feels like that data is still missing.

[-]

suprjami@reddit

Ooba did KLD tests and found Qwen 27B is basically unaffected except for long document tasks:

https://localbench.substack.com/p/kv-cache-quantization-benchmark

[-]

SSOMGDSJD@reddit

Oobas benches are rad, I wish he included some examples of bf16 vs quant responses on notable KLD findings so we could see what the actually differences were, eg just semantics or completely wrong answer

[-]

Klutzy-Snow8016@reddit

I use llama.cpp's default of fp16. I tried bf16, but it's multiple times slower on my hardware.

[-]

OneSlash137@reddit

Get ready for a shock. Tbe lobotomized versions of the models aren’t smarter than the baseline models.

[-]

Kahvana@reddit

For most models I use whatever gets me the context length I need, with reason.

As an example: if I need 128k context and it fits in BF16, great! If it doesn't, I drop to Q8_0 and test first if it's good enough for my use-case and then commit to it if so, and so on.

[-]

anomaly256@reddit

Do you even quant bro

[-]

dontbeeadick@reddit

need good solutions having tons of memory problems w my agents. great question

[-]

getstackfax@reddit

Following this. I’m more familiar with the high-level local vs cloud / hardware-fit side, but KV cache quantization seems like one of those details where the “right” answer depends heavily on model, context length, hardware, and whether you’re optimizing for speed, memory, or output quality.

[-]

tvall_@reddit

I use q8_0 because I'm poor and just have a couple Radeon pro v340l's for a total of 32gb vram and want really long context even though I don't really use much of it often enough.

I previously did q4_0 when I had just one of the cards and was running qwen3-vl-24b-reap and didn't notice any issues. but I wasn't doing as much with it back then.

[-]

PattF@reddit

I use 8, pretty much the same output as f16 but half the memory.