llama.cpp k-quants

Posted by thomas999999@reddit | LocalLLaMA | View on Reddit | 13 comments

Hello friends,

im currently reading about the k-quants in llama.cpp.

i always thought they use zeropoint quantization as discussed here for example:
https://arxiv.org/pdf/2103.13630

but it seems like they only do absmax and store the block minimum instead.

anyone can elaborate on why this is done? i assume its because it makes the inference more efficient? but why is this the case?