[llama.cpp] Asymmetric KV q8/q4 cache: current caveats and discussion in GGML repo

Posted by Ueberlord@reddit | LocalLLaMA | View on Reddit | 27 comments

Probably most of you are aware that using anything other than `-ctk q8_0 -ctv q8_0 / -ctk q4_0 -ctv q4_0` as startup options for llama.cpp leads to prompt processing on cpu instead of gpu for cuda at least. E.g. when we use the frequently suggested mix of `-ctk q8_0 -ctv q4_0` pps tanks. I have discussed this with a prop LLM and it suggested to add some slight modifications to the cuda source code of llama.cpp or use `cmake -DGGML_CUDA_FA_ALL_QUANTS=ON ..` which will take very long. But coincidentially, user sanmai on github did a small eval and suggested to include the kv cache quant combo during compilation, even without FA_ALL_QUANTS, so that would be great. Discussion is here, it is worth a read as the eval confirms that using the async 8/4 bit kv quant only costs 1.3% precision while saving more than half of memory compared to f16/f16: https://github.com/ggml-org/llama.cpp/discussions/23470

Reply to Post

Reply

27 Comments

[-]

Anbeeld@reddit

Don't use q8_0 / q4_0, please. It's too unbalanced, can wreck your tool calls and other precise data that get squashed into lossy q4_0. There are better alternatives, q8_0 / q5_1 is a premier one. Benchmark data: https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context

Reply

[-]

Pristine-Woodpecker@reddit

This table shows \`q5\_1-q4\_1\` beating \`q8\_0-q4\_1\`.

Reply

[-]

Anbeeld@reddit

>You deleted the thread where you first posted this because the reaction was so negative No, I deleted it because people ignored benchmarks, and were shitposting about the summary. >This table shows q5_1-q4_1 beating q8_0-q4_1. No, it does not show what you stated. >What's the source for this "unbalanced, can wreck your tool calls"? Isn't it just noise in your measurement? I actually use local LLMs and do a lot of testing, instead of just LARPing.

Reply

[-]

Pristine-Woodpecker@reddit

>I actually use local LLMs and do a lot of testing, instead of just LARPing. Ok, that being the argument tells me all I need to know now.

Reply

[-]

Anbeeld@reddit

You missed the point, which is if you stalk people over various Reddit threads and continuously try to devalue their work, this might negatively affect their desire to have a scientifical debate with you.

Reply

[-]

skullfuckr42@reddit

q5\_1 is bugged it seems like, it uses 100% cpu for no reason

Reply

[-]

Anbeeld@reddit

\`cmake -DGGML\_CUDA\_FA\_ALL\_QUANTS=ON\`

Reply

[-]

skullfuckr42@reddit

oh man, I spent hours fiddling with params without being able to figure out what was happening, tnx

Reply

[-]

chimpera@reddit

What might be interesting is to try to map different model quants and kv quants in terms of combined vram budget. In other words combination vram budget vs 99.9% loss.

Reply

[-]

Anbeeld@reddit

Well it might be but I'm not that big into fancy graphs, you can still make this comparison by reading the full tables so there's that.

Reply

[-]

chimpera@reddit

I more meant normalizing the data against the top model and kv quant and sort it by vram budget. I appreciate the data.

Reply

[-]

Anbeeld@reddit

There are already columns that compare precision and size vs bf16, and pairs are sorted by size. Is that not what you mean?

Reply

[-]

ea_man@reddit

Also worth noting the relation between context length and quantization: you may do with say q4\_0 when you have 20k context, yet as you go up to 80k - 120k you gonna need those q8\_0 to keep it worthwhile.

Reply

[-]

draconic_tongue@reddit

[https://github.com/ggml-org/llama.cpp/pull/21152](https://github.com/ggml-org/llama.cpp/pull/21152) Every combination I tested won and lost on the same exact questions, and this is not short context, for some questions qwen 3.6 thinks for 90k tokens

Reply

[-]

Anbeeld@reddit

How many AIME runs per 1 quant pair?

Reply

[-]

Look_0ver_There@reddit

That resource is fantastic. Truly well done. I have a question. F16 vs BF16 for KV Cache. A fair number of cards run BF16 more slowly than F16. Do you have any fidelity data regarding choosing F16 vs BF16?

Reply

[-]

Anbeeld@reddit

For now the linked benchmark is all I have, but I'll probably do some more in the future. The F16 vs BF16 is not the problem I face with my single 3090 tho, for me personally the main question was about q5 and q4.

Reply

[-]

Look_0ver_There@reddit

I understand. It just occurred to me when I saw your inclusion of performance beside the KLD results. The inclusion of FP16 would make the investigation be more complete. I also poked around online for other references regarding FP16 vs BF16 and there seems to be some resources that say that BF16's benefits are overrated and that just use FP16 and be happy, while some say BF16 is better for very long contexts. It'd be nice to have some better information on that one though since FP16 is usually more performant.

Reply

[-]

Ueberlord@reddit (OP)

Thanks for the link, I would not generally dismiss q8/q4, I think the 5% space savings compared to q8/q5_1 in memory can pay off depending on the available VRAM. If you have 5% to spare q8/q5_1 seems like a great choice, so I would definitely use that in case it becomes available for gpu.

Reply

[-]

JGeek00@reddit

What’s the difference between q5\_0 and q5\_1?

Reply

[-]

Anbeeld@reddit

Quants with _1 store one additional block for assymetry, which helps with precision in many cases. Makes them larger too, but still smaller than next full quant, so basically a good middle ground with more gains for V cache specifically. Personally, q4_0 wrecks my tool calls A LOT every time I try it.

Reply

[-]

Anbeeld@reddit

Yeah I'm just saying it's not the best option for the price, because q4 drags down this pairing quite a lot.

Reply

[-]

jrodder@reddit

I'm pretty sure my llama reverted back to CPU with that combo too. I ended up having to keep 8

Reply

[-]

Anbeeld@reddit

Well yes, you need the compile flag in any case. I was talking about quantization quality only.

Reply

[-]

tmvr@reddit

Forget about q4 altogether, if you really need to quantize the KV because you need space for more context then stick to q8/q8 and that's it.

Reply

[-]

hurdurdur7@reddit

You forgot to describe what are you using the model even for. For coding you shouldn't go under q8 anyway, preferably stay at fp16 (when you get past hello worlds then the time you save with q8 you will pay for in debugging and rerunning tests). If you do creative work, you're probably fine with q8 or q4 or even q5\_1 as suggested here in comments.

Reply

[-]

ParaboloidalCrest@reddit

I'd downgrade model weights to an abismal iq2_xxs before quantizing cache. Change my mind.

Reply