There is no proper explanation of GGUF quantization methods

Posted by Free_Significance267@reddit | LocalLLaMA | View on Reddit | 14 comments

This question has been asked multiple times before, and I have looked into a lot of those references that people have provided answering this question, but almost all of them just keep writing a simple line, e.x.: "Q4_K 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.", which does not reduce the ambiguity and then makes people keep asking the same question over and over.

At least https://huggingface.co/docs/hub/en/gguf#quantization-types provides a list, but still almost everywhere no precise formulation is provided, for a new person to know how exactly the super blocks and blocks are formed and how superblocks apply weight to blocks.

To add to my confusion, in https://github.com/ggerganov/llama.cpp/discussions/5063 there is a discussion about block-wise vs. row-wise implementation, where the author mentions: "The row- vs block-wise approach is independent of non-linear quantization. Yes, when you go to row-wise quantization, there is no need the number of columns to be divisible by 32 or 256. But in practice I assume that there is at least divisibility by 32, else the implementation becomes too cumbersome. But to my knowledge, all LLM's currently out there are divisible by at least 32 (if not even 64)."

, which makes me wonder if I understood the meaning of block correctly? Because elsewhere it is mentioned that ex. in legacy version Q8_0 there are blocks of 32 quants. So what is the difference between this row quantization and the block quantization? Is it just a matter of implementation that blocks can continue to pick from next row if the # columns is not divisible by 32? Or is there a more important concept that I don't understand.

So would it be possible that someone please provide a step by step formulation of how for example Q4_K quantization forms the super blocks and then the blocks inside and then provide detailed formulations of how the values are calculated?

Again thanks for your contributions.

[-]

MedicalScore3474@reddit

Before Quantization: Floating-Point Weights (BF16)

The layer consists of 2048 floating-point weights in BF16 (BFloat16) format:

Let's say we have an array of floating-point weights (in BF16 format) of a layer before quantization:

[1.25, -0.5, 2.3, -1.1, ..., 1.0, -0.9]   <-- 2048 weights in total

Each BF16 weight is 16 bits (2 bytes).
Total size before quantization: 2048×2=4096 bytes.

After Quantization: Q4_K Format

In Q4_K, we split the weights into blocks, quantize each to 4-bit values, and store shared metadata per sub-block and block.

Step 1: Divide into Blocks and Sub-Blocks

The 2048 weights are divided into 8 blocks, each containing 256 weights.
Each block of 256 weights is further divided into 8 sub-blocks of 32 weights each.

This gives us:

8 blocks in total, with each block containing 8 sub-blocks of 32 weights.

Sub-block 1    Sub-block 2    Sub-block 3    ...    Sub-block 8
[32 wts]   [32 wts]   [32 wts]   ...    [32 wts]

Block 1    Block 2    Block 3    ...    Block 8
[256 wts]   [256 wts]   [256 wts]   ...    [256 wts]

Step 2: Compute Scale and Offset for Each Block

Compute Block-Wide Scale (d) and Offset (dmin):

Calculate a block-wide d (scale) and dmin (minimum offset) as 16-bit floats (FP16), representing the maximum scale and offset within that block.

Compute 4-Bit Scales and Minimums for Sub-Blocks:

Each 32-weight sub-block has its own local scale and minimum value.
The scale and minimum for each sub-block are then normalized to fit within 6 bits each.
These are stored in a compact format where two 6-bit values are packed together within a single 8-bit byte, resulting in 8 bytes for scales and minimums per block.

For example:

Block	Scale	Offset (Min)
Block 1	0.2	-1.0
Block 2	0.15	-0.8
...	...	...
Block 8	0.3	-0.5

Step 3: Quantize Each Weight to 4 Bits

Each weight in a block is quantized to a 4-bit integer, fitting values from 0 to 15, using the formula:

quantized_weight=int((original_weight − sub_block_minimum_value) / sub_block_scale)

Example for Block 1:

Original weights: [1.25, -0.5, 2.3, -1.1, ...] 
Quantized (4-bit) weights: [10, 3, 15, 0, ...]  <-- each fits into 4 bits

Step 4: Pack Quantized Weights into Bytes

Since each quantized weight is 4 bits, two weights fit into one byte.

Example:

4-bit values in Block 1: [10, 3, 15, 0, ...]
Packed into bytes: [0xA3, 0xF0, ...]

Step 5: Store the Block’s Scale and Offset Metadata

Each block’s metadata includes:

Block-Wide Scale (d) and Offset (dmin):
    Both are stored as 16-bit FP16 values, totaling 4 bytes.

Normalized Sub-Block Scales and Offsets:
    8 sub-blocks, each with a 6-bit scale and a 6-bit minimum.
    Total size for scales and minimums: 8 bytes per block.

Example:

d = 0.2 (scaled down to fit FP16)
dmin = -1.0 (scaled down to fit FP16)
scales = [10, 12, ..., 15]  <-- scaled versions of each sub-block’s scale and min

Final Memory Layout 2

Section	Data	Size per Block
Quantized Weights	Packed 4-bit values	128 bytes
Sub-Block Scales	6-bit scales for 8 sub-blocks (packed)	8 bytes
Block-Wide Scale (d)	Block-wide scale in FP16	2 bytes
Block-Wide Offset (dmin)	Block-wide minimum in FP16	2 bytes
Total per Block		140 bytes
##Summary for the Entire Layer

Since we have 8 blocks, the total quantized representation for the layer is:

Section	Size (Bytes)
Quantized Weights	128×8=1024
Sub-Block Scales	8×8=64
Block-Wide Scale and Offset	4×8=32
Total	1120 bytes

Comparison: Before and After Quantization

Data	Format	Total Size (Bytes)
Original Floating-Point Weights	BF16 (16 bits)	4096 bytes
Quantized Weights (Q4_K)	Q4_K	1120 bytes

For Q4_K:

Layer (4096 weights)
├── Block 1 (256 weights)
│   ├── Sub-block 1 (32 weights, local scale, and offset)
│   ├── Sub-block 2 (32 weights, local scale, and offset)
│   └── ...
│   └── Sub-block 8 (32 weights, local scale, and offset)
│   └── Block-wide scale and offset
├── Block 2 (256 weights)
│   └── ...
└── Block 16 (256 weights)

So would it be possible that someone please provide a step by step formulation of how for example Q4_K quantization forms the super blocks and then the blocks inside and then provide detailed formulations of how the values are calculated?

Sure. I made the assumption that a super block contains all of the weights in one layer, though this may be incorrect. In quants.py, we see that a block contains 256 (from QK_K). As for how the values (quantized weights, scaled, offsets) are calculated, see the above.

[-]

compilade@reddit

Small correction: the scales and mins are packed in 12 bytes, not 8. There are 8 sub-block scales and mins in Q4_K taking 6-bit each, which makes (2 * 6 * 8) / 8 = 12 bytes.

[-]

MedicalScore3474@reddit

Corrected, thank you!

mojojojo_24@reddit

I know I'm a year late, but I also got really frustrated by the lack of proper documentation. So I spent a week reading the code and made an up-to-date YT explainer: https://youtu.be/vW30o4U9BFE?si=OIN0zVPyz5raKxUi. Also, here's a write-up (contributions are welcome!): https://github.com/iuliaturc/gguf-docs

121507090301@reddit

I even tried looking into the llamacpp code to see it I could understand these stuff myself, but as I don't know enough c I couldn't understand, so I'm interested in any explanation anyone has about it as well...

ArtyfacialIntelagent@reddit

If only there were some kind of tool that could auto-translate C code into Python and explain what is going on...

RhubarbSimilar1683@reddit

There's also some c++ features that don't translate deterministically to python and vice versa.

The problem is that llamacpp is a very big project, so I couldn't even really find what I wanted to see which was the code related to conversion or just the inference part. I did have IA help me with what I found but it couldn't help me find what I wanted so I decided to take another look at it later...

Successful_Shake8348@reddit

copy to ai and ask it.. i use regulary ai to help me with commands and error problems.. and it works. iam just a script kiddy, no real programmer

Block	Scale	Offset (Min)
Block 1	0.2	-1.0
Block 2	0.15	-0.8
...	...	...
Block 8	0.3	-0.5

Section	Data	Size per Block
Quantized Weights	Packed 4-bit values	128 bytes
Sub-Block Scales	6-bit scales for 8 sub-blocks (packed)	8 bytes
Block-Wide Scale (d)	Block-wide scale in FP16	2 bytes
Block-Wide Offset (dmin)	Block-wide minimum in FP16	2 bytes
Total per Block		140 bytes
##Summary for the Entire Layer

Section	Size (Bytes)
Quantized Weights	128×8=1024
Sub-Block Scales	8×8=64
Block-Wide Scale and Offset	4×8=32
Total	1120 bytes

Data	Format	Total Size (Bytes)
Original Floating-Point Weights	BF16 (16 bits)	4096 bytes
Quantized Weights (Q4_K)	Q4_K	1120 bytes

Some_Endian_FP17@reddit

Wait up, how does it work for more esoteric quantization formats like Q4_0_4_4 or Q4_0_4_8 where there's a mixture of weight types?

AI_is_the_rake@reddit

1. Block Formation

In Q4_K, the model weights are first divided into blocks. Each block contains 32 weights.

Why Blocks?

Blocks are structured so that all weights within each block share the same scale and min (offset) values. This shared scaling approach reduces the number of unique parameters stored, saving memory while maintaining enough information to reconstruct approximate values.

2. Quantizing Each Weight in a Block

Each weight in a block is stored using 4 bits.

Quantization Process:

Range Determination: For each block of 32 weights, determine the range of the weights. This involves finding: - Minimum (Min): The smallest weight value in the block. - Maximum (Max): The largest weight value in the block.
Scale Calculation: The scale determines the step size or spacing between quantized values. Calculate it as: [ \text{scale} = \frac{\text{Max} - \text{Min}}{15} ]

Here, 15 is used because a 4-bit quantized number can represent 16 levels (0–15). Dividing by 15 allows the quantized values to cover the range between the min and max precisely.

Quantized Value Calculation: Each weight ( w ) in the block is quantized to a 4-bit integer ( q ) using the formula:

[ q = \text{round} \left( \frac{w - \text{Min}}{\text{scale}} \right) ]

Here, ( q ) is an integer in the range [0, 15].

3. Storing Scale and Min for Each Block

6-Bit Quantization of Scale and Min: To save memory, both the scale and min values for each block are quantized to 6 bits before being stored. This 6-bit quantization is an additional approximation that introduces minor rounding but reduces the total storage needed for the parameters.
Memory Savings: By quantizing these values, each block requires only 4.5 bits per weight on average (4 bits per weight plus 0.5 bits for the shared scale and min values).

4. Organizing Blocks into Super-blocks

In Q4_K, 8 blocks are grouped together to form a single super-block. This gives each super-block a total of:

[ 8 \text{ blocks} \times 32 \text{ weights per block} = 256 \text{ weights} ]

Why Super-blocks?

Super-blocks are structured to improve memory access efficiency, as the model can load and process weights in larger chunks. This approach optimizes data transfer and speeds up computation, especially in environments with memory access limitations.

5. Reconstructing Quantized Values

To reconstruct an approximate version of the original weight values from the quantized data:

For each weight ( q ) (stored as a 4-bit integer) in the block, apply the reverse of the quantization formula:

[ w \approx q \times \text{scale} + \text{Min} ]

where: - scale and min are the 6-bit quantized values retrieved for the block. - ( q ) is the quantized 4-bit value of the weight.

This calculation approximates the original floating-point weight values by scaling and shifting according to the block’s shared scale and min.

mantafloppy@reddit

ChatGpt answer seem to make sense.

``` GGUF quantization can definitely seem cryptic, especially with all the technical jargon around "super-blocks," "blocks," and differences between row-wise and block-wise quantization. I can help break this down to clarify the terminology and the specifics of GGUF quantization, particularly the Q4_K type.

Basics of GGUF Quantization

The purpose of GGUF (Generalized Grouped Uniform Format) quantization is to reduce the size of large language models (LLMs) by compressing weights. Different types of GGUF quantization (e.g., Q4_K, Q8_0) have their own unique structures for organizing and quantizing these weights.

Quantization Levels: GGUF typically refers to the number of bits used per weight. For instance, "4-bit" quantization means weights are stored using 4 bits instead of the typical 32 bits in floating-point format.
Block Size: Blocks are smaller, fixed-size groups of weights. For instance, in many quantization types, weights are split into blocks of 32. Each block contains weight values that are grouped together for compression.
Super-Block Size: Super-blocks are larger structures that contain several blocks. For Q4_K, each super-block contains 8 blocks (each with 32 weights).

Q4_K Quantization Process

In Q4_K, the model weights are quantized into 4-bit values with the following structure: - Block Structure: A super-block in Q4_K contains 8 blocks, each holding 32 weights. - Scales and Minimums: Each block has associated "scale" and "minimum" values to represent the range of weights more accurately. In Q4_K, scales and minimums are quantized using 6 bits.

Here's a step-by-step breakdown of the process:

Partition into Super-Blocks and Blocks:
Divide the model’s weight matrix into super-blocks, each containing 8 blocks.
Each block within the super-block has 32 weights.
The weights in each block are quantized with 4 bits per weight.
Calculate Scales and Minimums:
For each block, determine the minimum and maximum weights.
Quantize this range information using 6 bits for scales and mins. This helps to map each 4-bit quantized weight back to its approximate floating-point value.
Weight Encoding:
Each 4-bit value in the block represents an index on a scale defined by the block’s min and scale values.
The storage requirement is 4.5 bits per weight (bpw), which combines the 4 bits for weights plus the overhead for storing scales and mins.

Row-Wise vs. Block-Wise Quantization

Here's where things get trickier. Block-wise and row-wise quantization differ mainly in the granularity and flexibility in weight grouping:

Block-Wise Quantization:
Blocks of fixed sizes are defined across the entire weight matrix, usually in groups of 32 or 256 columns.
For example, each block might represent 32 consecutive weights, regardless of the matrix's row/column layout. This rigidity can sometimes waste memory if the model's matrix isn't perfectly divisible by the block size.
Row-Wise Quantization:
Quantizes weights row by row, rather than fixing a standard block size.
Provides more flexibility when matrices are not divisible by 32 or 256, as it can adapt to the row layout, making implementation easier in such cases.

For most large LLMs, matrices are often divisible by 32 or 64 columns, making the block-wise approach more common. Row-wise quantization, however, is useful in cases where weights are organized in a way that doesn’t align with strict block sizes.

Summing It Up

To clarify: - In Q4_K: "super-blocks" are units that group 8 blocks of 32 weights each, with 4-bit quantized values, 6-bit scales, and mins. - Block-Wise quantization is fixed in structure, whereas Row-Wise quantization adjusts more dynamically based on rows.

If you're looking for the exact calculations or a deeper dive into the encoding math, I'd recommend looking at specific library code like llama.cpp. There, you can see how weights are packed, which can help illuminate the underlying implementation.

Free_Significance267@reddit (OP)

This still does not explain what happens regarding the super blocks and blocks. Similar to human responses, it mentions that superblock contains 8 blocks. So if there is no computation involved in the superblocks then what is the point of the defining the superblock? Does the superblock have an extra scaling weight that works in combination with its subblock scales?

Besides, I dont think we can rely on chatGPT results on this. Thanks though.

There a good discussion here (second comment), might be what you are looking for.

https://www.reddit.com/r/LocalLLaMA/comments/1dved4c/llamacpp_kquants/

I dont know how much math you know, from the little i understand myself, its advance matrix math, a simple awnser might not be able to be found, short of exploring the code and having a good grasp of the math used.

Good luck :)