Qwen3 Next imatrix GGUFs up!
Posted by noneabove1182@reddit | LocalLLaMA | View on Reddit | 18 comments
Just figured I'd post in case anyone's looking for imatrix and IQ quants
https://huggingface.co/bartowski/Qwen_Qwen3-Next-80B-A3B-Instruct-GGUF
https://huggingface.co/bartowski/Qwen_Qwen3-Next-80B-A3B-Thinking-GGUF
As usual this also uses my PR/fork for slightly more optimized MoE quantization
https://github.com/ggml-org/llama.cpp/pull/12727
YearZero@reddit
I'm still trying to figure out if I can squeeze the thing into 8GB VRAM + 32GB RAM with the least amount of brain damage, might have to give these a shot! Thank you!
nickless07@reddit
use one of the smaller quants?
IQ2_XXS 19.7 GB
IQ2_XS 22.1 GB
IQ2_S 22.2 GB
load only the expert weights into GPU (--override-tensor '([3-8]+).ffn_.*_exps.=CPU' should do the trick) and it works like a charm.
ForsookComparison@reddit
How well does 80B hold up to Q2 quantization?
I know Llama 3.3 70B held onto some of its wits, but with so few active params I feel like Qwen3-Next-80B could be unusable at that level
YearZero@reddit
Can confirm IQ2_XS is much dumber than Q4.
pmttyji@reddit
It's better to try Q4 first before going down(for this model). Check my other comment.
nickless07@reddit
MoE are way better then dense models at smaller quants. Q2 or Q3 are almoust like Q4 on dense llm. At least that is what i noticed during my little tests. Just give it a try.
pmttyji@reddit
Don't worry. We have same config. And someone shared me the stats with similar config. 17 t/s with Q4. KVCache Q8 could give additional t/s.
Enjoy!
ANR2ME@reddit
Increase your swap file size 😁
ANR2ME@reddit
Btw, which one better between imatrix GGUF vs Unsloth Dynamic 2.0 GGUF? 🤔
AppearanceHeavy6724@reddit
My vibe checks suggest that OG q4_k_m and unsloth q4_k_xl are both better than imatrix stuff.
ANR2ME@reddit
That is a surprise 😯
AppearanceHeavy6724@reddit
Ymmv though. Imatrix stuff could be better at coding, but at creative writing.
ForsookComparison@reddit
Quantization is still a very poorly documented/benched field. Try them all, report back when done.
noneabove1182@reddit (OP)
Hard to quantify, old tests showed they were basically identical, with some of my quants being better for the size and some of theirs being better for the size, don't think you can go wrong with either
Though it should be noted, unsloth's quants also typically use imatrix to achieve that similar performance, and this time around they didn't so for low bit rates I would expect these to be better
ANR2ME@reddit
Hmm.. if i asked google, it said UD 2.0 outperformed standard imatrix 🤔
noneabove1182@reddit (OP)
I mean, that's just restating what they post on their card, it doesn't mean it's necessarily true 😅
Also it's worth pointing out that bit per bit, their quants are probably better than mainline llamacpp for MoE models, hence the fork I maintain which brings some much needed intelligence to the quantization layers of MoE models, same as unsloth
Just imatrix vs their "dynamic 2.0", imatrix likely wins
Imatrix with their quant wins
Imatrix with my quant, it's a toss up
knvn8@reddit
Anyone else tried it yet? I tried the EXL3 version and thought it went off the rails really quickly, curious whether quantization method has much effect.
Odd-Ordinary-5922@reddit
thanks bartowski