Same 4 bits. Very different quality. (quant.cpp vs llama.cpp KV compression)

Posted by Suitable-Song-302@reddit | LocalLLaMA | View on Reddit | 11 comments

Both use 4-bit KV quantization. One breaks the model, the other doesn't.

The difference is how you quantize. llama.cpp applies the same Q4_0 scheme to both keys and values. quant.cpp quantizes them independently — per-block min-max (128 elements) for keys, Q4 with per-block scales for values. Outliers stay local instead of corrupting the whole tensor.

Result on WikiText-2 (SmolLM2 1.7B):

llama.cpp Q4_0 KV: PPL +10.6% (noticeable degradation)
quant.cpp 4-bit: PPL +0.0% (within measurement noise)
quant.cpp 3-bit delta: PPL +1.3% (stores key differences like video P-frames)

What this means in practice: on a 16GB Mac with Llama 3.2 3B, llama.cpp runs out of KV memory around 50K tokens. quant.cpp compresses KV 6.9x and extends to \~350K tokens — with zero quality loss.

Not trying to replace llama.cpp. It's faster. But if context length is your bottleneck, this is the only engine that compresses KV without destroying it.

72K LOC of pure C, zero dependencies. Also ships as a single 15K-line header file you can drop into any C project.

Source: github.com/quantumaikr/quant.cpp

[-]

ttkciar@reddit

Violates Rule Four: Self-promotion

[-]

chimpera@reddit

whats the relationship to https://github.com/cenconq25/delta-compress-llm

[-]

Suitable-Song-302@reddit (OP)

No relationship. I'm not familiar with that project — just looked at the repo and it appears to be a different approach (applying delta compression to model weights rather than KV cache).

quant.cpp compresses the KV cache at runtime — the key and value vectors that accumulate during inference. The model weights themselves are loaded from standard GGUF files and used as-is. Delta compression in our case means storing `key[t] - key[t-1]` between adjacent tokens in the same attention head, not compressing the weight tensors.

The underlying idea (delta encoding of correlated vectors) is the same, but applied to completely different data.

[-]

chimpera@reddit

It is used on the kv not the model.

[-]

audioen@reddit

This is not even correct. llama.cpp can apply separate quantization types to k and v cache. llama.cpp Q4_0 is also per-block method, it applies a single f16 scale factor to a small group of weights. If memory serves, that group is 32 values, which yields fp16/32 = 0.5 additional bits per weight. A 4-bit quantization of min-max range is similar to q4_1 which is also supported within the engine and likely can be enabled with some compile option if it isn't already provided. This uses in average 5 bits per weight. A larger block size could bring that down, e.g. 128 values in f16+f16 likely quantizes to 4.25 bits per weight.

For now, llama.cpp users should probably use q8_0 when under memory pressure, and maybe dip to q4_0 for the V cache, which is generally tested as being less critical. It isn't as good as 4 bits, but it is there and now. Comparing against 16-bit baseline is not really honest, but I chalk this up to just ignorance. After all, you don't seem to know how Q4_0 works ("outliers stay local instead of corrupting the whole tensor") nor are you aware of the fact that K and V caches have separate quantization types that can be set with --cache-type-k and --cache-type-v.

[-]

Look_0ver_There@reddit

With the recent KV cache rotation changes, Q8_0 for K, and Q5_0 for V was looking to be about the best tradeoff for quality. Not sure about speed though.

[-]

Suitable-Song-302@reddit (OP)

That makes sense — keeping K at higher precision is exactly the right call since attention scores are more sensitive to key quantization error than value quantization error. Q8_0 K + Q5_0 V gives you \~1.6x compression with minimal quality loss.

quant.cpp's pitch at that point becomes: if 1.6x is enough, use llama.cpp — it's faster. If you need 4-7x (extending 50K context to 200K+), that's where 4-bit K + Q4 V and delta compression come in. Different operating points on the compression-quality curve.

I should add this nuance to the comparison. Thanks for bringing up the KV rotation work — haven't benchmarked against it yet.

[-]

Look_0ver_There@reddit

The results I'm referring to are here: https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4146397570

The KLD does suffer a bit at K=Q8_0/V=Q5_0, but PPL is almost the same as F16/F16. Obviously stick with Q8_0 on both for the best quality, but if you need to penny-pinch that last GB, then it looks best not to drop V below Q5_0.

[-]

Suitable-Song-302@reddit (OP)

You're right on several points and I should correct the post.

What I got wrong: llama.cpp Q4_0 *is* per-block (32 elements per block, 1 FP16 scale), not per-tensor. And llama.cpp can apply separate quant types to K and V — that's not a quant.cpp-only feature. The original wording overstated the difference. I'll fix it.

What is different:

- Block size: Q4_0 uses 32-element blocks. quant.cpp uses 128-element blocks with both min and max (effectively Q4_1-style at wider blocks). The larger block amortizes scale overhead better (4.25 bits/element vs Q4_0's 4.5 or Q4_1's 5.0), but the quality difference comes more from the min-max vs zero-point approach on key distributions specifically.

- Delta compression: This is the part llama.cpp genuinely doesn't have. Storing `key[t] - key[t-1]` instead of absolute keys reduces the dynamic range by \~70%, which is why 3-bit works at +1.3% PPL where absolute 3-bit gives +62%. This is the novel contribution from the TurboQuant paper, not the 4-bit uniform quantization itself.

- The PPL +10.6% number: This was measured with Q4_0 on both K and V using the default llama.cpp KV quant path. You're right that Q8_0 K + Q4_0 V (or Q5_0 V) would be significantly better. I should benchmark that specific config and update the comparison to be fair.

Fair criticism. The honest comparison is: at the same total bit budget, quant.cpp's approach preserves more quality. But the original post made it sound like llama.cpp's quantization is fundamentally broken, which isn't true — it's just a different tradeoff with coarser granularity.

[-]

hauhau901@reddit

When you even use llm to write your comments and replies you lose any and all credibility that this isn't vibe coded slop.

[-]

Suitable-Song-302@reddit (OP)

Fair enough. I do use Claude Code for development and I don't hide that. But the Reddit comments are mine - just not a native English speaker, so they probably come out sounding weirdly polished.

The code compiles, the PPL numbers are reproducible, and I just corrected the comparison after u/audioen pointed out it was unfair. Judge by that, not by how my comments read.