Overview of GGUF quantization methods

Posted by he29@reddit | LocalLLaMA | View on Reddit | 12 comments

I was getting confused by all the new quantization methods available for llama.cpp, so I did some testing and GitHub discussion reading. In case anyone finds it helpful, here is what I found and how I understand the current state.

TL;DR:

Details

I decided to finally try Qwen 1.5 72B after realizing how high it ranks in the LLM arena. Given that I'm limited to 16 GB of VRAM, my previous experience with 4-bit 70B models was s.l.o.w and I almost never used them. So instead I tried using the new IQ3_M, which is a fair bit smaller and not much worse quality-wise. But, to my surprise, despite fitting more of it into VRAM, it ran even slower.

So I wanted to find out why, and what is the difference between all the different quantization types that now keep appearing every few weeks. By no means am I an expert on this, so take everything with a shaker of salt. :)

Legacy quants (Q4_0, Q4_1, Q8_0, ...)

K-quants (Q3_K_S, Q5_K_M, ...)

I-quants (IQ2_XXS, IQ3_S, ...)

Future ??-quants

Importance matrix

Somewhat confusingly introduced around the same as the i-quants, which made me think that they are related and the "i" refers to the "imatrix". But this is apparently not the case, and you can make both legacy and k-quants that use imatrix, and i-quants that do not. All the imatrix does is telling the quantization method which weights are more important, so that it can pick the per-block constants in a way that prioritizes minimizing error of the important weights. The only reason why i-quants and imatrix appeared at the same time was likely that the first presented i-quant was a 2-bit one – without the importance matrix, such a low bpw quant would be simply unusable.

Note that this means you can't easily tell whether a model was quantized with the help of importance matrix just from the name. I first found this annoying, because it was not clear if and how the calibration dataset affects performance of the model in other than just positive ways. But recent tests in llama.cpp discussion #5263 show, that while the data used to prepare the imatrix slightly affect how it performs in (un)related languages or specializations, any dataset will perform better than a "vanilla" quantization with no imatrix. So now, instead, I find it annoying because sometimes the only way to be sure I'm using the better imatrix version is to re-quantize the model myself.

So, that's about it. Please feel free to add more information or point out any mistakes; it is getting late in my timezone, so I'm running on a rather low IQ at the moment. :)