llama.cpp k-quants

Posted by thomas999999@reddit | LocalLLaMA | View on Reddit | 13 comments

Hello friends,

im currently reading about the k-quants in llama.cpp.

i always thought they use zeropoint quantization as discussed here for example:
https://arxiv.org/pdf/2103.13630

but it seems like they only do absmax and store the block minimum instead.

anyone can elaborate on why this is done? i assume its because it makes the inference more efficient? but why is this the case?

[-]

mojojojo_24@reddit

I know I'm a year late, but I got really frustrated by the lack of proper documentation around the various quants and the importance matrix. So I spent a week reading the code and made an up-to-date YT explainer: https://youtu.be/vW30o4U9BFE?si=OIN0zVPyz5raKxUi. Also, here's a write-up (contributions are welcome!): https://github.com/iuliaturc/gguf-docs

[-]

SomeOddCodeGuy@reddit

There are 2 github issues that may give an explanation for this:

https://github.com/ggerganov/llama.cpp/pull/1684 (scroll down to the "how" section)

and

https://github.com/ggerganov/llama.cpp/discussions/5063

A lot of this is over my head, but if I'm understanding them correctly then they chose to go with blocks for efficiency as it allows for faster inference and reduces the amount of computation needed at inference.

But chances are someone more versed in the tech could likely get more out of it than me lol

[-]

DeltaSqueezer@reddit

Thanks. You know if there's a similar write up of IQ quants?

[-]

Vegetable_Sun_9225@reddit

The level of detail put into describing 1684 is pure class.

[-]

SomeOddCodeGuy@reddit

Yea the folks over in the llama.cpp github are amazingly through. I'm constantly impressed by them.

[-]

compilade@reddit

but it seems like they only do absmax and store the block minimum instead.

It's more complicated than that. Although this is true for the Q?_0 and Q?_1 quant types (e.g. Q8_0 is using only absmax and round-to-nearest), the k-quants have a more elaborate way to find the scale and min.

If you want to explore this fully, have a look at the make_qx_quants function in ggml-quants.c (knowing that rmse_type is always 1) which is used to find the scale of Q3_K and Q6_K (i.e. the k-quants which don't use a min, a bit like Q8_0). You'll see that absmax is used to find the initial guess of the scale, but then it's tweaked through 18 possible values and only the "best" one is kept.

For the k-quants which do have a min, (Q2_K, Q4_K, and Q5_K), there's the make_qkx2_quants function which seems to do something similar but with a min too.

These make quantization much slower than for non-k-quants (and this is a bit why there's no Python re-implementation of quantization for k-quants, unlike for Q8_0 (I tried reimplementing Q6_K with Numpy once, but got very low single-digit MB/s quantization speeds)), but dequantization is still fast because there's no need to find ideal values, it's only masks and multiplications.

I don't really understand exactly how these function work (because I didn't yet dive that deep into them), but at least now you know where to look.

[-]

thomas999999@reddit (OP)

Thanks a lot for the detailed answer. You seem to know a lot about this stuff so i have more questions if you dont mind :D

One more thing i always wondered is why stuff like Adaround or simmilar things arent used in llama.cpp?
As far as i understand the only downside is that i need example data for the activations but for llms i always have a embedding table that holds every possible input activation i can receive, so wouldnt it be possible to just pick random rows of the embedding table as calibration data?

Also i guess the I-quants use something simmilar to AdaRound (or better) but if i understood correctly they still require some example input text to the generate the "importance matrix" (no idea what this is).

[-]

compilade@reddit

why stuff like Adaround or simmilar things arent used in llama.cpp?

I don't know why. Personally, I've never heard of it, so thank you for mentioning it. Seems like the paper is https://arxiv.org/abs/2004.10568 Interesting. It's always nice to read new (to me) papers. I see it's also mentioned in the paper you linked in the top-level post.

AdaRound seems to use layer-wise scales, while llama.cpp uses block-wise scales. From section 5, it seems they already also minimize the mean of the squared differences to find the layer-wise scale before applying AdaRound, which is similar (but different) to the minimization of the sum of the squared differences in k-quants.

but for llms i always have a embedding table that holds every possible input activation i can receive, so wouldnt it be possible to just pick random rows of the embedding table as calibration data?

There was a discussion regarding using random tokens in imatrix calibration datasets, but textual and/or structured data ended up being better for the attention part of the models. See this insightful comment by ikawrakow.

if i understood correctly [I-quants] still require some example input text to the generate the "importance matrix" (no idea what this is).

The importance matrix is used to turn the sum of squared differences I mentioned earlier into a weighted sum of squared differences, although it's applied with some math involving sigma and square roots (the quant_weights variable is the imatrix vector). I'm not sure where this formula comes from. Hopefully someone with a more formal statistics background would know.

i guess the I-quants use something simmilar to AdaRound

I does look somewhat similar (but not quite identical) to AdaRound.

The importance matrix seems to be calculated by taking the column-wise means of the squares of all the elements ever matrix-multiplied (before the matmul) with a given weight tensor for a given calibration dataset (but it's also multiplied by the number of calls to matmuls involving that weight tensor, which I'm not sure of the effect). It seems like the shape of the imatrix is actually a vector with the same number of elements as there are in a row of the associated tensor, for each tensor (with an extra dimension for the tensors used in indirect matrix multiplications used in MoE models, to separate the experts).

I'm thinking I should probably write some explanations regarding k-quants and i-quants in the llama.cpp wiki page on tensor encoding schemes (which is missing a lot of info on exactly this), at least for the parts where I'm confident about my understanding. But it might take some time; I have other stuff to finish (notably, Jamba support, and 1.625 bpw ternary packing for BitNet b1.58).

[-]

duruixuan@reddit

Hi, I tried to write out the math behind the k quants by staring a bit at the code and trying to convince myself of the reasoning behind some of the steps. https://github.com/AghaDurrani/llamacpp/blob/main/llamacpp.pdf Comments/feedback would be very much appreciated

[-]

troposfer@reddit

Is there a beginner friendly document about k-quants

[-]

Necessary-Donkey5574@reddit

ChatGPT told me it decreases storage overhead by needing fewer auxiliary parameters because you’re storing only the single absolute maximum with minimums for each block instead of the scaling factor for each block and zero point for each block. This then means fewer memory accesses during dequantization.

I hardly know anything about quantization so idk if this is just totally wrong.

[-]

bgighjigftuik@reddit

There is basically no good info online on how are quants related to llama.cpp created/their rationale over other methods.

People just decide what to use through trial and error, but to me that is insufficient. I actually want to know what I am running (this is the only way I can think of to even attempt to reproduce results)

[-]

noneabove1182@reddit

Upvoted because it's a great question and I wish I had an answer

My best guess would be something along what you're speculating, efficiency and reliability/repeatability

All I know for sure is back when GPTQ and GGUF (or ggml at the time) were competing for market share we pretty collectively thought of GGUF as a poor-mans quantization, how could round to nearest with absmax and block minimums do anything useful? And yet test after test shows that the level of quality maintained from this "naive" approach is extremely high and it's basically the defacto quantization option