There is no proper explanation of GGUF quantization methods

Posted by Free_Significance267@reddit | LocalLLaMA | View on Reddit | 14 comments

  1. This question has been asked multiple times before, and I have looked into a lot of those references that people have provided answering this question, but almost all of them just keep writing a simple line, e.x.: "Q4_K 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.", which does not reduce the ambiguity and then makes people keep asking the same question over and over.

At least https://huggingface.co/docs/hub/en/gguf#quantization-types provides a list, but still almost everywhere no precise formulation is provided, for a new person to know how exactly the super blocks and blocks are formed and how superblocks apply weight to blocks.

  1. To add to my confusion, in https://github.com/ggerganov/llama.cpp/discussions/5063 there is a discussion about block-wise vs. row-wise implementation, where the author mentions: "The row- vs block-wise approach is independent of non-linear quantization. Yes, when you go to row-wise quantization, there is no need the number of columns to be divisible by 32 or 256. But in practice I assume that there is at least divisibility by 32, else the implementation becomes too cumbersome. But to my knowledge, all LLM's currently out there are divisible by at least 32 (if not even 64)."

, which makes me wonder if I understood the meaning of block correctly? Because elsewhere it is mentioned that ex. in legacy version Q8_0 there are blocks of 32 quants. So what is the difference between this row quantization and the block quantization? Is it just a matter of implementation that blocks can continue to pick from next row if the # columns is not divisible by 32? Or is there a more important concept that I don't understand.

  1. So would it be possible that someone please provide a step by step formulation of how for example Q4_K quantization forms the super blocks and then the blocks inside and then provide detailed formulations of how the values are calculated?

Again thanks for your contributions.