There is no proper explanation of GGUF quantization methods
Posted by Free_Significance267@reddit | LocalLLaMA | View on Reddit | 14 comments
- This question has been asked multiple times before, and I have looked into a lot of those references that people have provided answering this question, but almost all of them just keep writing a simple line, e.x.: "Q4_K 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using
4.5
bpw.", which does not reduce the ambiguity and then makes people keep asking the same question over and over.
At least https://huggingface.co/docs/hub/en/gguf#quantization-types provides a list, but still almost everywhere no precise formulation is provided, for a new person to know how exactly the super blocks and blocks are formed and how superblocks apply weight to blocks.
- To add to my confusion, in https://github.com/ggerganov/llama.cpp/discussions/5063 there is a discussion about block-wise vs. row-wise implementation, where the author mentions: "The row- vs block-wise approach is independent of non-linear quantization. Yes, when you go to row-wise quantization, there is no need the number of columns to be divisible by 32 or 256. But in practice I assume that there is at least divisibility by 32, else the implementation becomes too cumbersome. But to my knowledge, all LLM's currently out there are divisible by at least 32 (if not even 64)."
, which makes me wonder if I understood the meaning of block correctly? Because elsewhere it is mentioned that ex. in legacy version Q8_0 there are blocks of 32 quants. So what is the difference between this row quantization and the block quantization? Is it just a matter of implementation that blocks can continue to pick from next row if the # columns is not divisible by 32? Or is there a more important concept that I don't understand.
- So would it be possible that someone please provide a step by step formulation of how for example Q4_K quantization forms the super blocks and then the blocks inside and then provide detailed formulations of how the values are calculated?
Again thanks for your contributions.
mojojojo_24@reddit
I know I'm a year late, but I also got really frustrated by the lack of proper documentation. So I spent a week reading the code and made an up-to-date YT explainer: https://youtu.be/vW30o4U9BFE?si=OIN0zVPyz5raKxUi. Also, here's a write-up (contributions are welcome!): https://github.com/iuliaturc/gguf-docs
121507090301@reddit
I even tried looking into the llamacpp code to see it I could understand these stuff myself, but as I don't know enough c I couldn't understand, so I'm interested in any explanation anyone has about it as well...
ArtyfacialIntelagent@reddit
If only there were some kind of tool that could auto-translate C code into Python and explain what is going on...
RhubarbSimilar1683@reddit
There's also some c++ features that don't translate deterministically to python and vice versa.
121507090301@reddit
The problem is that llamacpp is a very big project, so I couldn't even really find what I wanted to see which was the code related to conversion or just the inference part. I did have IA help me with what I found but it couldn't help me find what I wanted so I decided to take another look at it later...
Successful_Shake8348@reddit
copy to ai and ask it.. i use regulary ai to help me with commands and error problems.. and it works. iam just a script kiddy, no real programmer
MedicalScore3474@reddit
Before Quantization: Floating-Point Weights (BF16)
The layer consists of 2048 floating-point weights in BF16 (BFloat16) format:
Let's say we have an array of floating-point weights (in BF16 format) of a layer before quantization:
After Quantization: Q4_K Format
In Q4_K, we split the weights into blocks, quantize each to 4-bit values, and store shared metadata per sub-block and block.
Step 1: Divide into Blocks and Sub-Blocks
This gives us:
8 blocks in total, with each block containing 8 sub-blocks of 32 weights.
Step 2: Compute Scale and Offset for Each Block
Compute Block-Wide Scale (d) and Offset (dmin):
Compute 4-Bit Scales and Minimums for Sub-Blocks:
For example:
Step 3: Quantize Each Weight to 4 Bits
Each weight in a block is quantized to a 4-bit integer, fitting values from 0 to 15, using the formula:
Example for Block 1:
Step 4: Pack Quantized Weights into Bytes
Since each quantized weight is 4 bits, two weights fit into one byte.
Example:
Step 5: Store the Block’s Scale and Offset Metadata
Each block’s metadata includes:
Example:
Final Memory Layout 2
Since we have 8 blocks, the total quantized representation for the layer is:
Comparison: Before and After Quantization
For Q4_K:
Sure. I made the assumption that a super block contains all of the weights in one layer, though this may be incorrect. In
quants.py
, we see that a block contains 256 (fromQK_K
). As for how the values (quantized weights, scaled, offsets) are calculated, see the above.compilade@reddit
Small correction: the scales and mins are packed in 12 bytes, not 8. There are 8 sub-block scales and mins in
Q4_K
taking 6-bit each, which makes(2 * 6 * 8) / 8 = 12
bytes.MedicalScore3474@reddit
Corrected, thank you!
Some_Endian_FP17@reddit
Wait up, how does it work for more esoteric quantization formats like Q4_0_4_4 or Q4_0_4_8 where there's a mixture of weight types?
AI_is_the_rake@reddit
1. Block Formation
In Q4_K, the model weights are first divided into blocks. Each block contains 32 weights.
Why Blocks?
Blocks are structured so that all weights within each block share the same scale and min (offset) values. This shared scaling approach reduces the number of unique parameters stored, saving memory while maintaining enough information to reconstruct approximate values.
2. Quantizing Each Weight in a Block
Each weight in a block is stored using 4 bits.
Quantization Process:
Here, 15 is used because a 4-bit quantized number can represent 16 levels (0–15). Dividing by 15 allows the quantized values to cover the range between the min and max precisely.
[ q = \text{round} \left( \frac{w - \text{Min}}{\text{scale}} \right) ]
Here, ( q ) is an integer in the range [0, 15].
3. Storing Scale and Min for Each Block
4. Organizing Blocks into Super-blocks
In Q4_K, 8 blocks are grouped together to form a single super-block. This gives each super-block a total of:
[ 8 \text{ blocks} \times 32 \text{ weights per block} = 256 \text{ weights} ]
Why Super-blocks?
Super-blocks are structured to improve memory access efficiency, as the model can load and process weights in larger chunks. This approach optimizes data transfer and speeds up computation, especially in environments with memory access limitations.
5. Reconstructing Quantized Values
To reconstruct an approximate version of the original weight values from the quantized data:
[ w \approx q \times \text{scale} + \text{Min} ]
where: - scale and min are the 6-bit quantized values retrieved for the block. - ( q ) is the quantized 4-bit value of the weight.
This calculation approximates the original floating-point weight values by scaling and shifting according to the block’s shared scale and min.
mantafloppy@reddit
ChatGpt answer seem to make sense.
``` GGUF quantization can definitely seem cryptic, especially with all the technical jargon around "super-blocks," "blocks," and differences between row-wise and block-wise quantization. I can help break this down to clarify the terminology and the specifics of GGUF quantization, particularly the Q4_K type.
Basics of GGUF Quantization
The purpose of GGUF (Generalized Grouped Uniform Format) quantization is to reduce the size of large language models (LLMs) by compressing weights. Different types of GGUF quantization (e.g., Q4_K, Q8_0) have their own unique structures for organizing and quantizing these weights.
Quantization Levels: GGUF typically refers to the number of bits used per weight. For instance, "4-bit" quantization means weights are stored using 4 bits instead of the typical 32 bits in floating-point format.
Block Size: Blocks are smaller, fixed-size groups of weights. For instance, in many quantization types, weights are split into blocks of 32. Each block contains weight values that are grouped together for compression.
Super-Block Size: Super-blocks are larger structures that contain several blocks. For Q4_K, each super-block contains 8 blocks (each with 32 weights).
Q4_K Quantization Process
In Q4_K, the model weights are quantized into 4-bit values with the following structure: - Block Structure: A super-block in Q4_K contains 8 blocks, each holding 32 weights. - Scales and Minimums: Each block has associated "scale" and "minimum" values to represent the range of weights more accurately. In Q4_K, scales and minimums are quantized using 6 bits.
Here's a step-by-step breakdown of the process:
The weights in each block are quantized with 4 bits per weight.
Calculate Scales and Minimums:
Quantize this range information using 6 bits for scales and mins. This helps to map each 4-bit quantized weight back to its approximate floating-point value.
Weight Encoding:
Row-Wise vs. Block-Wise Quantization
Here's where things get trickier. Block-wise and row-wise quantization differ mainly in the granularity and flexibility in weight grouping:
For example, each block might represent 32 consecutive weights, regardless of the matrix's row/column layout. This rigidity can sometimes waste memory if the model's matrix isn't perfectly divisible by the block size.
Row-Wise Quantization:
For most large LLMs, matrices are often divisible by 32 or 64 columns, making the block-wise approach more common. Row-wise quantization, however, is useful in cases where weights are organized in a way that doesn’t align with strict block sizes.
Summing It Up
To clarify: - In Q4_K: "super-blocks" are units that group 8 blocks of 32 weights each, with 4-bit quantized values, 6-bit scales, and mins. - Block-Wise quantization is fixed in structure, whereas Row-Wise quantization adjusts more dynamically based on rows.
If you're looking for the exact calculations or a deeper dive into the encoding math, I'd recommend looking at specific library code like llama.cpp. There, you can see how weights are packed, which can help illuminate the underlying implementation.
Free_Significance267@reddit (OP)
This still does not explain what happens regarding the super blocks and blocks. Similar to human responses, it mentions that superblock contains 8 blocks. So if there is no computation involved in the superblocks then what is the point of the defining the superblock? Does the superblock have an extra scaling weight that works in combination with its subblock scales?
Besides, I dont think we can rely on chatGPT results on this. Thanks though.
mantafloppy@reddit
There a good discussion here (second comment), might be what you are looking for.
https://www.reddit.com/r/LocalLLaMA/comments/1dved4c/llamacpp_kquants/
I dont know how much math you know, from the little i understand myself, its advance matrix math, a simple awnser might not be able to be found, short of exploring the code and having a good grasp of the math used.
Good luck :)