Qwen3 Next imatrix GGUFs up!

Posted by noneabove1182@reddit | LocalLLaMA | View on Reddit | 18 comments

Just figured I'd post in case anyone's looking for imatrix and IQ quants

https://huggingface.co/bartowski/Qwen_Qwen3-Next-80B-A3B-Instruct-GGUF

https://huggingface.co/bartowski/Qwen_Qwen3-Next-80B-A3B-Thinking-GGUF

As usual this also uses my PR/fork for slightly more optimized MoE quantization

https://github.com/ggml-org/llama.cpp/pull/12727

[-]

YearZero@reddit

I'm still trying to figure out if I can squeeze the thing into 8GB VRAM + 32GB RAM with the least amount of brain damage, might have to give these a shot! Thank you!

[-]

use one of the smaller quants?
IQ2_XXS 19.7 GB
IQ2_XS 22.1 GB
IQ2_S 22.2 GB
load only the expert weights into GPU (--override-tensor '([3-8]+).ffn_.*_exps.=CPU' should do the trick) and it works like a charm.

[-]

ForsookComparison@reddit

How well does 80B hold up to Q2 quantization?

I know Llama 3.3 70B held onto some of its wits, but with so few active params I feel like Qwen3-Next-80B could be unusable at that level

[-]

YearZero@reddit

Can confirm IQ2_XS is much dumber than Q4.

[-]

pmttyji@reddit

It's better to try Q4 first before going down(for this model). Check my other comment.

[-]

nickless07@reddit

MoE are way better then dense models at smaller quants. Q2 or Q3 are almoust like Q4 on dense llm. At least that is what i noticed during my little tests. Just give it a try.

[-]

pmttyji@reddit

Don't worry. We have same config. And someone shared me the stats with similar config. 17 t/s with Q4. KVCache Q8 could give additional t/s.

Enjoy!

[-]

ANR2ME@reddit

Increase your swap file size 😁

[-]

ANR2ME@reddit

Btw, which one better between imatrix GGUF vs Unsloth Dynamic 2.0 GGUF? 🤔

[-]

AppearanceHeavy6724@reddit

My vibe checks suggest that OG q4_k_m and unsloth q4_k_xl are both better than imatrix stuff.

[-]

ANR2ME@reddit

That is a surprise 😯

[-]

AppearanceHeavy6724@reddit

Ymmv though. Imatrix stuff could be better at coding, but at creative writing.

[-]

ForsookComparison@reddit

Quantization is still a very poorly documented/benched field. Try them all, report back when done.

[-]

noneabove1182@reddit (OP)

Hard to quantify, old tests showed they were basically identical, with some of my quants being better for the size and some of theirs being better for the size, don't think you can go wrong with either

Though it should be noted, unsloth's quants also typically use imatrix to achieve that similar performance, and this time around they didn't so for low bit rates I would expect these to be better

[-]

ANR2ME@reddit

Hmm.. if i asked google, it said UD 2.0 outperformed standard imatrix 🤔

Unsloth Dynamic 2.0 generally outperforms Imatrix GGUF due to its layer-by-layer optimization, which preserves accuracy better than standard quantization methods, including those using Imatrix. While Imatrix helps preserve important information, Unsloth's Dynamic 2.0 takes it a step further by intelligently choosing the best quantization type for each layer, not just the constants. This often leads to superior accuracy with similar or even better compression than other methods.

[-]

noneabove1182@reddit (OP)

I mean, that's just restating what they post on their card, it doesn't mean it's necessarily true 😅

Also it's worth pointing out that bit per bit, their quants are probably better than mainline llamacpp for MoE models, hence the fork I maintain which brings some much needed intelligence to the quantization layers of MoE models, same as unsloth

Just imatrix vs their "dynamic 2.0", imatrix likely wins

Imatrix with their quant wins

Imatrix with my quant, it's a toss up

[-]

knvn8@reddit

Anyone else tried it yet? I tried the EXL3 version and thought it went off the rails really quickly, curious whether quantization method has much effect.

[-]

Odd-Ordinary-5922@reddit

thanks bartowski