APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier

Posted by mudler_it@reddit | LocalLLaMA | View on Reddit | 35 comments

Quick follow-up on APEX, the MoE-aware mixed-precision quant strategy. The original post was just about Qwen 3.5 35B-A3B ( https://www.reddit.com/r/LocalLLaMA/comments/1s9vzry/apex_moe_quantized_models_boost_with_33_faster/ ); since then the collection has grown to 30+ MoEs across most major families. Plus a new ultra-compressed tier landed.

Feedback so far

The reports coming back have been honestly better than I expected!

Long context holds up. People report APEX I-Balanced and I-Compact retaining coherence well past 32k tokens on the 30-50B-class MoEs, even at sizes where uniform Q4_K starts visibly degrading. The hypothesis: keeping shared experts and edge layers high-precision (where rare/long-range tokens get routed and embedded) preserves the long-context behavior that aggressive uniform quants tend to break. Numbers back this up by having by far best KL99% value across other models
Coding quants punch above their size. Qwen3.6 35b a3b users in particular have been flagging that I-Compact and I-Mini stay surprisingly close to F16 on real code tasks vs the size class would suggest.

Thanks to everyone reporting back, that's what justifies pushing further on the low-bit tiers below.

Models added since the first post

Grouped by family, most are 30-70B-class MoEs that fit one consumer GPU at I-Mini/I-Compact:

Qwen lineage

Qwen 3.5 122B-A10B, Qwen 3.5 397B-A17B, Qwen3.5 Claude-Distilled, Qwen3.5 Fernflower (uncensored), Qwen3.5 TQ
Qwen 3.6 35B-A3B, +heretic, +Claude 4.6 distill, +Claude 4.7 distill
Qwen3-Coder 30B, Qwen3-Coder Next

Frontier-size MoEs (rented Blackwell to quantize)

MiniMax-M2.5, MiniMax-M2.7 — 228B / 24B active, the biggest yet
Mistral-Small 4 119B-2603
NVIDIA Nemotron-3-Super 120B-A12B
GLM-4.7 Flash, Step-3.5 Flash
Nemotron-3-Nano 30B-A3B, Nemotron-3-Nano-Omni Reasoning — multimodal (vision + audio + text)
Holo3 35B-A3B
Huihui3.5 67B-A3B

Hybrid Mamba / SSM MoEs

Nemotron-3-Nano 30B-A3B, Nemotron-3-Nano-Omni Reasoning — multimodal (vision + audio + text)
Holo3 35B-A3B
LFM2 24B-A2B

Gemma 4 family

gemma-4 26B-A4B-it (just re-quantized today with Google's updated chat template), +Claude Opus distill, +heretic, Gemopus-4 Preview

Community MoE merges

Carnice MoE 35B-A3B, Carnice-Qwen3.6, Qwopus MoE 35B-A3B

New tier: I-Nano (IQ2_XXS)

Pushes mid-layer routed experts down to 2.06 bpw, near-edge to IQ2_S, edges to Q3_K, shared experts at Q5_K. About 20% smaller than I-Mini, viable only on MoE thanks to sparse per-token expert activation. Requires imatrix.

Examples:

Qwen 3.5 35B-A3B: I-Mini 13 GB → I-Nano 11 GB
Nemotron Omni 30B: I-Mini 18 GB → I-Nano 17 GB (less savings — denser shared expert)

Links

Collection: https://huggingface.co/collections/mudler/apex-quants-gguf
Project + paper: https://github.com/mudler/apex-quant

If you've used APEX quants and have feedback, comments welcome!

3. Five tiers

Configuration

Size

Expert strategy

Best for

APEX I-Quality

21.3 GB

3-tier gradient with IQ4_XS middle, diverse imatrix

Best accuracy across benchmarks

APEX Quality

21.3 GB

3-tier gradient with IQ4_XS middle layers

Lowest perplexity of any quantization

APEX I-Balanced

23.6 GB

2-tier gradient (Q6_K edges, Q5_K middle), diverse imatrix

All-round with lower KL divergence

APEX Balanced

23.6 GB

2-tier gradient (Q6_K edges, Q5_K middle)

Interactive use, serving, general purpose

APEX I-Compact

16.1 GB

Q4_K edges, Q3_K middle, diverse imatrix

16 GB GPUs, best accuracy at this size

APEX Compact

16.1 GB

Q4_K edges (L0-4, L35-39), Q3_K middle (L5-34), Q6_K shared, Q4_K attn

Consumer 24 GB GPUs, fastest inference

APEX Mini

12.2 GB

Layer gradient with IQ2_S middle, diverse imatrix

Consumer 16 GB VRAM, smallest viable

[-]

inddiepack@reddit

Thank you for your work, mudler! I've been using your quants since gemma 4 came out, and are now my go to quants for both gemma 4 and qwen 3.6 MoE models.

RelicDerelict@reddit

Is APEX better than Autoround?

mudler_it@reddit (OP)

can't answer that as I don't like to talk about what I haven't personally tried. I benchmarked only against Unsloth and bartowski quants, I can tell it holds way better long context and is better at coding agent tasks.

NicholasCureton@reddit

I've been using Qwen3.6-35B-A3B-APEX-I-Compact.gguf on my 8GB VRAM 16GB. 29-37t/s. It's coded entire Multi Model Agentic Chat Inference Cli Client app for me. I've tried Qwen3.6-35B-A3B-UD-Q5_K_M.gguf in Office PC which have 12GB VRAM. Somehow, I feels like your quant made less error in coding. I don't know what you did but for coding, it's really good.

letsgoiowa@reddit

Any plans to post direct to Ollama for those who are still working on getting llama.cpp configured?

not really fond of ollama. I can suggest you to run these on LocalAI :P

scarydeaddan@reddit

MiniMax M2.7 APEX Mini on my strix halo box has been great for coding tasks. obv slow with the low memory bandwidth but context keeps and output is very useable.

Glad to hear!

Bulky-Priority6824@reddit

I was a big fan of the 3.5 35 a3b but unfortunately I've had to stick with unsloth for 3.6 because I couldn't get the chat template to stop sending think tags to frigate which I also use the llm for genai on security cameras.

this is quite weird - what/how are you running it?

Hot_Turnip_3309@reddit

these APEX models are great!

Thanks! Glad to hear!

SirDomz@reddit

Any plans to do mlx? Would love to compare those with the Oq quants by OMLX /Jundot

I give it a shot but MLX has far less articulated support to quantization schemes. I'm monitoring closely the MLX ecosystem and will push quants when there is feature parity on MLX.

fatboy93@reddit

Yeah, how do this fare against Optiq as well?

No_Algae1753@reddit

Just a question: How is it possible that even with a slightly higher kld your models beat unsloth models in benchmarks?

Top-Rub-4670@reddit

Unsloth's models are optimized specifically to get that low kld.

Presumably/hopefully APEX is optimized for real-world performance and not to look good on charts for marketing purposes.

Thanks for calling this out - I carefully engineered APEX quants to be based on real-use case rather than pumping benchmarks. I'm glad it shows.

it is very comparable in term of KLD - but if you look closer at numbers the KLD Max is in favor of APEX by a bigger span. This is a better signal than taking account of 0.0001 :-)

sanjxz54@reddit

u/mudler_it Please fix nemotron 3 120b apex-i-mini quant, its weight is only 9gb, when you can. amazing work otherwise,love the 3.6 35b for coding

good catch and thanks for flagging this, for >=120b I need to rent a GPU and that's a bit out of the game now. We had a donor that gave me access to GPUs for a while, but now back at donation-only capacity and still doesn't cut it.

sterby92@reddit

I just tried a few prompts with Qwen3.6-35B-A3B-APEX-GGUF:APEX-I-Balanced, around coding, web research, and agentic tool use. And as far as I can tell it feels noticable better than the unsloth Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL quant. It is 2GB larger, but still runs around 5tok/sec faster, 55tok/sec -> 60tok/sec on strix halo.
So looks and feels great! I'll continue using it and see how it goes :)

Evening_Ad6637@reddit

I can confirm this. I’m using the apex-I-balanced the day since mudler released it and I am so impressed, it’s much better than anything I tried (regarding quants) and as you say, it somehow is faster than other quants as well.

I only have one logical explanation: this must be black magic

apex quants are my daily driver now too :) It's amusing to see many other calls out this as black magic, it kinda feels like it!

Thanks for the feedback!

metmelo@reddit

I tested Qwen Coder Next some time ago and felt the same way.

Substantial_Step_351@reddit

Makes sense, especially the shared expert precision theory. Rare tokens route there more often and they carry the long range signal that uniform quant normally flattens. The thing I'd like to see tested is whether the KL99% advantage holds on actual document lenght tasks vs synthetic needle in haystack, those usually diverge. More than most quant strategies even have a testable hypothesis tbh

BitGreen1270@reddit

Man thank you for doing this! Going to give gemma4 a spin today. Question - do we stick to the recommended sampling params published by Google? Also how do these hold up with kv quantization to q8?

horeaper@reddit

I still don't understand the difference between Quality and Balanced. If I'm for coding, which one should I use?

pmttyji@reddit

Both github repo & model cards have those details.

It basically says nothing. How do you compare "Best overall" and "Highest quality"?

Github repo version better.

Configuration	Size	Expert strategy	Best for
APEX I-Quality	21.3 GB	3-tier gradient with IQ4_XS middle, diverse imatrix	Best accuracy across benchmarks
APEX Quality	21.3 GB	3-tier gradient with IQ4_XS middle layers	Lowest perplexity of any quantization
APEX I-Balanced	23.6 GB	2-tier gradient (Q6_K edges, Q5_K middle), diverse imatrix	All-round with lower KL divergence
APEX Balanced	23.6 GB	2-tier gradient (Q6_K edges, Q5_K middle)	Interactive use, serving, general purpose
APEX I-Compact	16.1 GB	Q4_K edges, Q3_K middle, diverse imatrix	16 GB GPUs, best accuracy at this size
APEX Compact	16.1 GB	Q4_K edges (L0-4, L35-39), Q3_K middle (L5-34), Q6_K shared, Q4_K attn	Consumer 24 GB GPUs, fastest inference
APEX Mini	12.2 GB	Layer gradient with IQ2_S middle, diverse imatrix	Consumer 16 GB VRAM, smallest viable

Ohh, ok, this is a lot better. So I assume Quality and Balanced are on the same level, yet Quality is smaller/slower, Balanced is faster/bigger. Need to do some testing after my new GPU arrived from shipping >_<

Since you created GGUFs for early models like Qwen3-Coder-30B, I have request for few early & recent models. Please create GGUFs if possible. Thanks on behalf of all.

Kimi-Linear-48B-A3B-Instruct
Ling-mini-2.0
Trinity-Mini
Marco-Mini-Instruct
GLM-4.5-Air
Solar-Open-100B

AffectionateOcelot7@reddit

I don't see any Qwen 3.6 numbers, your KLD and Perplexity numbers are only posted for 3.5. So are you assuming that the numbers are the same for 3.6 or is there another benchmark that I'm missing?

streppelchen@reddit

found the models by accident, will still need to give them a try, but i like the idea, keep it up :)