llama.cpp k-quants
Posted by thomas999999@reddit | LocalLLaMA | View on Reddit | 13 comments
Hello friends,
im currently reading about the k-quants in llama.cpp.
i always thought they use zeropoint quantization as discussed here for example:
https://arxiv.org/pdf/2103.13630
but it seems like they only do absmax and store the block minimum instead.
anyone can elaborate on why this is done? i assume its because it makes the inference more efficient? but why is this the case?
mojojojo_24@reddit
I know I'm a year late, but I got really frustrated by the lack of proper documentation around the various quants and the importance matrix. So I spent a week reading the code and made an up-to-date YT explainer: https://youtu.be/vW30o4U9BFE?si=OIN0zVPyz5raKxUi. Also, here's a write-up (contributions are welcome!): https://github.com/iuliaturc/gguf-docs
SomeOddCodeGuy@reddit
There are 2 github issues that may give an explanation for this:
https://github.com/ggerganov/llama.cpp/pull/1684 (scroll down to the "how" section)
and
https://github.com/ggerganov/llama.cpp/discussions/5063
A lot of this is over my head, but if I'm understanding them correctly then they chose to go with blocks for efficiency as it allows for faster inference and reduces the amount of computation needed at inference.
But chances are someone more versed in the tech could likely get more out of it than me lol
DeltaSqueezer@reddit
Thanks. You know if there's a similar write up of IQ quants?
Vegetable_Sun_9225@reddit
The level of detail put into describing 1684 is pure class.
SomeOddCodeGuy@reddit
Yea the folks over in the llama.cpp github are amazingly through. I'm constantly impressed by them.
compilade@reddit
It's more complicated than that. Although this is true for the
Q?_0
andQ?_1
quant types (e.g.Q8_0
is using onlyabsmax
and round-to-nearest), the k-quants have a more elaborate way to find the scale and min.If you want to explore this fully, have a look at the
make_qx_quants
function inggml-quants.c
(knowing thatrmse_type
is always1
) which is used to find the scale ofQ3_K
andQ6_K
(i.e. the k-quants which don't use a min, a bit likeQ8_0
). You'll see thatabsmax
is used to find the initial guess of the scale, but then it's tweaked through 18 possible values and only the "best" one is kept.For the k-quants which do have a min, (
Q2_K
,Q4_K
, andQ5_K
), there's themake_qkx2_quants
function which seems to do something similar but with a min too.These make quantization much slower than for non-k-quants (and this is a bit why there's no Python re-implementation of quantization for k-quants, unlike for
Q8_0
(I tried reimplementingQ6_K
with Numpy once, but got very low single-digitMB/s
quantization speeds)), but dequantization is still fast because there's no need to find ideal values, it's only masks and multiplications.I don't really understand exactly how these function work (because I didn't yet dive that deep into them), but at least now you know where to look.
thomas999999@reddit (OP)
Thanks a lot for the detailed answer. You seem to know a lot about this stuff so i have more questions if you dont mind :D
One more thing i always wondered is why stuff like Adaround or simmilar things arent used in llama.cpp?
As far as i understand the only downside is that i need example data for the activations but for llms i always have a embedding table that holds every possible input activation i can receive, so wouldnt it be possible to just pick random rows of the embedding table as calibration data?
Also i guess the I-quants use something simmilar to AdaRound (or better) but if i understood correctly they still require some example input text to the generate the "importance matrix" (no idea what this is).
compilade@reddit
I don't know why. Personally, I've never heard of it, so thank you for mentioning it. Seems like the paper is https://arxiv.org/abs/2004.10568 Interesting. It's always nice to read new (to me) papers. I see it's also mentioned in the paper you linked in the top-level post.
AdaRound seems to use layer-wise scales, while
llama.cpp
uses block-wise scales. From section 5, it seems they already also minimize the mean of the squared differences to find the layer-wise scale before applying AdaRound, which is similar (but different) to the minimization of the sum of the squared differences in k-quants.There was a discussion regarding using random tokens in imatrix calibration datasets, but textual and/or structured data ended up being better for the attention part of the models. See this insightful comment by ikawrakow.
The importance matrix is used to turn the sum of squared differences I mentioned earlier into a weighted sum of squared differences, although it's applied with some math involving sigma and square roots (the
quant_weights
variable is the imatrix vector). I'm not sure where this formula comes from. Hopefully someone with a more formal statistics background would know.I does look somewhat similar (but not quite identical) to AdaRound.
The importance matrix seems to be calculated by taking the column-wise means of the squares of all the elements ever matrix-multiplied (before the matmul) with a given weight tensor for a given calibration dataset (but it's also multiplied by the number of calls to matmuls involving that weight tensor, which I'm not sure of the effect). It seems like the shape of the imatrix is actually a vector with the same number of elements as there are in a row of the associated tensor, for each tensor (with an extra dimension for the tensors used in indirect matrix multiplications used in MoE models, to separate the experts).
I'm thinking I should probably write some explanations regarding k-quants and i-quants in the
llama.cpp
wiki page on tensor encoding schemes (which is missing a lot of info on exactly this), at least for the parts where I'm confident about my understanding. But it might take some time; I have other stuff to finish (notably, Jamba support, and 1.625 bpw ternary packing for BitNet b1.58).duruixuan@reddit
Hi, I tried to write out the math behind the k quants by staring a bit at the code and trying to convince myself of the reasoning behind some of the steps. https://github.com/AghaDurrani/llamacpp/blob/main/llamacpp.pdf Comments/feedback would be very much appreciated
troposfer@reddit
Is there a beginner friendly document about k-quants
Necessary-Donkey5574@reddit
ChatGPT told me it decreases storage overhead by needing fewer auxiliary parameters because you’re storing only the single absolute maximum with minimums for each block instead of the scaling factor for each block and zero point for each block. This then means fewer memory accesses during dequantization.
I hardly know anything about quantization so idk if this is just totally wrong.
bgighjigftuik@reddit
There is basically no good info online on how are quants related to llama.cpp created/their rationale over other methods.
People just decide what to use through trial and error, but to me that is insufficient. I actually want to know what I am running (this is the only way I can think of to even attempt to reproduce results)
noneabove1182@reddit
Upvoted because it's a great question and I wish I had an answer
My best guess would be something along what you're speculating, efficiency and reliability/repeatability
All I know for sure is back when GPTQ and GGUF (or ggml at the time) were competing for market share we pretty collectively thought of GGUF as a poor-mans quantization, how could round to nearest with absmax and block minimums do anything useful? And yet test after test shows that the level of quality maintained from this "naive" approach is extremely high and it's basically the defacto quantization option