Mixed Precision Quants

Posted by nikgeo25@reddit | LocalLLaMA | View on Reddit | 5 comments

Is anybody using mixed precision quantizations on the regular? Like having one part of the model at 8 bit and another at 4 bit fp.

What methods are you using for deciding which layers / experts should be higher precision?

[-]

nastywoodelfxo@reddit

most people dont manually pick layers, they use importantmatrix or similar tools that measure per-layer perplexity during a calibration pass. the layers that hurt quality the most at lower precision get bumped to higher bits

if youre doing it manually, attention layers usually benefit most from higher precision, especially on instruct models where instruction following degrades fast with bad quantization of those layers

[-]

nikgeo25@reddit (OP)

That's a great reference, thanks

[-]

nastywoodelfxo@reddit

used them a bit - ran some internal benchmarks and the improvement on perplexity was tiny compared to the memory savings

ended up keeping attention heads higher precision, rest at q4. haven't done anything systematic with layer-wise importance sampling though

[-]

BlueDolphinCute@reddit

I've played around with it a bit. Usually I keep embeddings and output layers at higher precision and quantize most of the rest. Honestly, I mostly decide based on benchmark results rather than any fancy layer analysis.

[-]

ABLPHA@reddit

Who doesn't use them? Like 99% of GGUFs are mixed precision nowadays, you just choose the provider that works best for you, which most of the time is unsloth.