Mixed Precision Quants
Posted by nikgeo25@reddit | LocalLLaMA | View on Reddit | 5 comments
Is anybody using mixed precision quantizations on the regular? Like having one part of the model at 8 bit and another at 4 bit fp.
What methods are you using for deciding which layers / experts should be higher precision?
nastywoodelfxo@reddit
most people dont manually pick layers, they use importantmatrix or similar tools that measure per-layer perplexity during a calibration pass. the layers that hurt quality the most at lower precision get bumped to higher bits
if youre doing it manually, attention layers usually benefit most from higher precision, especially on instruct models where instruction following degrades fast with bad quantization of those layers
nikgeo25@reddit (OP)
That's a great reference, thanks
nastywoodelfxo@reddit
used them a bit - ran some internal benchmarks and the improvement on perplexity was tiny compared to the memory savings
ended up keeping attention heads higher precision, rest at q4. haven't done anything systematic with layer-wise importance sampling though
BlueDolphinCute@reddit
I've played around with it a bit. Usually I keep embeddings and output layers at higher precision and quantize most of the rest. Honestly, I mostly decide based on benchmark results rather than any fancy layer analysis.
ABLPHA@reddit
Who doesn't use them? Like 99% of GGUFs are mixed precision nowadays, you just choose the provider that works best for you, which most of the time is unsloth.