APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier
Posted by mudler_it@reddit | LocalLLaMA | View on Reddit | 35 comments
Quick follow-up on APEX, the MoE-aware mixed-precision quant strategy. The original post was just about Qwen 3.5 35B-A3B ( https://www.reddit.com/r/LocalLLaMA/comments/1s9vzry/apex_moe_quantized_models_boost_with_33_faster/ ); since then the collection has grown to 30+ MoEs across most major families. Plus a new ultra-compressed tier landed.
Feedback so far
The reports coming back have been honestly better than I expected!
- Long context holds up. People report APEX I-Balanced and I-Compact retaining coherence well past 32k tokens on the 30-50B-class MoEs, even at sizes where uniform Q4_K starts visibly degrading. The hypothesis: keeping shared experts and edge layers high-precision (where rare/long-range tokens get routed and embedded) preserves the long-context behavior that aggressive uniform quants tend to break. Numbers back this up by having by far best KL99% value across other models
- Coding quants punch above their size. Qwen3.6 35b a3b users in particular have been flagging that I-Compact and I-Mini stay surprisingly close to F16 on real code tasks vs the size class would suggest.
Thanks to everyone reporting back, that's what justifies pushing further on the low-bit tiers below.
Models added since the first post
Grouped by family, most are 30-70B-class MoEs that fit one consumer GPU at I-Mini/I-Compact:
Qwen lineage
- Qwen 3.5 122B-A10B, Qwen 3.5 397B-A17B, Qwen3.5 Claude-Distilled, Qwen3.5 Fernflower (uncensored), Qwen3.5 TQ
- Qwen 3.6 35B-A3B, +heretic, +Claude 4.6 distill, +Claude 4.7 distill
- Qwen3-Coder 30B, Qwen3-Coder Next
Frontier-size MoEs (rented Blackwell to quantize)
- MiniMax-M2.5, MiniMax-M2.7 — 228B / 24B active, the biggest yet
- Mistral-Small 4 119B-2603
- NVIDIA Nemotron-3-Super 120B-A12B
- GLM-4.7 Flash, Step-3.5 Flash
- Nemotron-3-Nano 30B-A3B, Nemotron-3-Nano-Omni Reasoning — multimodal (vision + audio + text)
- Holo3 35B-A3B
- Huihui3.5 67B-A3B
Hybrid Mamba / SSM MoEs
- Nemotron-3-Nano 30B-A3B, Nemotron-3-Nano-Omni Reasoning — multimodal (vision + audio + text)
- Holo3 35B-A3B
- LFM2 24B-A2B
Gemma 4 family
- gemma-4 26B-A4B-it (just re-quantized today with Google's updated chat template), +Claude Opus distill, +heretic, Gemopus-4 Preview
Community MoE merges
- Carnice MoE 35B-A3B, Carnice-Qwen3.6, Qwopus MoE 35B-A3B
New tier: I-Nano (IQ2_XXS)
Pushes mid-layer routed experts down to 2.06 bpw, near-edge to IQ2_S, edges to Q3_K, shared experts at Q5_K. About 20% smaller than I-Mini, viable only on MoE thanks to sparse per-token expert activation. Requires imatrix.
Examples:
- Qwen 3.5 35B-A3B: I-Mini 13 GB → I-Nano 11 GB
- Nemotron Omni 30B: I-Mini 18 GB → I-Nano 17 GB (less savings — denser shared expert)
Links
- Collection: https://huggingface.co/collections/mudler/apex-quants-gguf
- Project + paper: https://github.com/mudler/apex-quant
If you've used APEX quants and have feedback, comments welcome!
inddiepack@reddit
Thank you for your work, mudler! I've been using your quants since gemma 4 came out, and are now my go to quants for both gemma 4 and qwen 3.6 MoE models.
RelicDerelict@reddit
Is APEX better than Autoround?
mudler_it@reddit (OP)
can't answer that as I don't like to talk about what I haven't personally tried. I benchmarked only against Unsloth and bartowski quants, I can tell it holds way better long context and is better at coding agent tasks.
NicholasCureton@reddit
I've been using Qwen3.6-35B-A3B-APEX-I-Compact.gguf on my 8GB VRAM 16GB. 29-37t/s. It's coded entire Multi Model Agentic Chat Inference Cli Client app for me. I've tried Qwen3.6-35B-A3B-UD-Q5_K_M.gguf in Office PC which have 12GB VRAM. Somehow, I feels like your quant made less error in coding. I don't know what you did but for coding, it's really good.
letsgoiowa@reddit
Any plans to post direct to Ollama for those who are still working on getting llama.cpp configured?
mudler_it@reddit (OP)
not really fond of ollama. I can suggest you to run these on LocalAI :P
scarydeaddan@reddit
MiniMax M2.7 APEX Mini on my strix halo box has been great for coding tasks. obv slow with the low memory bandwidth but context keeps and output is very useable.
mudler_it@reddit (OP)
Glad to hear!
Bulky-Priority6824@reddit
I was a big fan of the 3.5 35 a3b but unfortunately I've had to stick with unsloth for 3.6 because I couldn't get the chat template to stop sending think tags to frigate which I also use the llm for genai on security cameras.
mudler_it@reddit (OP)
this is quite weird - what/how are you running it?
Hot_Turnip_3309@reddit
these APEX models are great!
mudler_it@reddit (OP)
Thanks! Glad to hear!
SirDomz@reddit
Any plans to do mlx? Would love to compare those with the Oq quants by OMLX /Jundot
mudler_it@reddit (OP)
I give it a shot but MLX has far less articulated support to quantization schemes. I'm monitoring closely the MLX ecosystem and will push quants when there is feature parity on MLX.
fatboy93@reddit
Yeah, how do this fare against Optiq as well?
No_Algae1753@reddit
Just a question: How is it possible that even with a slightly higher kld your models beat unsloth models in benchmarks?
Top-Rub-4670@reddit
Unsloth's models are optimized specifically to get that low kld.
Presumably/hopefully APEX is optimized for real-world performance and not to look good on charts for marketing purposes.
mudler_it@reddit (OP)
Thanks for calling this out - I carefully engineered APEX quants to be based on real-use case rather than pumping benchmarks. I'm glad it shows.
mudler_it@reddit (OP)
it is very comparable in term of KLD - but if you look closer at numbers the KLD Max is in favor of APEX by a bigger span. This is a better signal than taking account of 0.0001 :-)
sanjxz54@reddit
u/mudler_it Please fix nemotron 3 120b apex-i-mini quant, its weight is only 9gb, when you can. amazing work otherwise,love the 3.6 35b for coding
mudler_it@reddit (OP)
good catch and thanks for flagging this, for >=120b I need to rent a GPU and that's a bit out of the game now. We had a donor that gave me access to GPUs for a while, but now back at donation-only capacity and still doesn't cut it.
sterby92@reddit
I just tried a few prompts with Qwen3.6-35B-A3B-APEX-GGUF:APEX-I-Balanced, around coding, web research, and agentic tool use. And as far as I can tell it feels noticable better than the unsloth Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL quant. It is 2GB larger, but still runs around 5tok/sec faster, 55tok/sec -> 60tok/sec on strix halo.
So looks and feels great! I'll continue using it and see how it goes :)
Evening_Ad6637@reddit
I can confirm this. I’m using the apex-I-balanced the day since mudler released it and I am so impressed, it’s much better than anything I tried (regarding quants) and as you say, it somehow is faster than other quants as well.
I only have one logical explanation: this must be black magic
mudler_it@reddit (OP)
apex quants are my daily driver now too :) It's amusing to see many other calls out this as black magic, it kinda feels like it!
Thanks for the feedback!
metmelo@reddit
I tested Qwen Coder Next some time ago and felt the same way.
Substantial_Step_351@reddit
Makes sense, especially the shared expert precision theory. Rare tokens route there more often and they carry the long range signal that uniform quant normally flattens. The thing I'd like to see tested is whether the KL99% advantage holds on actual document lenght tasks vs synthetic needle in haystack, those usually diverge. More than most quant strategies even have a testable hypothesis tbh
BitGreen1270@reddit
Man thank you for doing this! Going to give gemma4 a spin today. Question - do we stick to the recommended sampling params published by Google? Also how do these hold up with kv quantization to q8?
horeaper@reddit
I still don't understand the difference between Quality and Balanced. If I'm for coding, which one should I use?
pmttyji@reddit
Both github repo & model cards have those details.
horeaper@reddit
It basically says nothing. How do you compare "Best overall" and "Highest quality"?
pmttyji@reddit
Github repo version better.
3. Five tiers
horeaper@reddit
Ohh, ok, this is a lot better. So I assume Quality and Balanced are on the same level, yet Quality is smaller/slower, Balanced is faster/bigger. Need to do some testing after my new GPU arrived from shipping >_<
pmttyji@reddit
Since you created GGUFs for early models like Qwen3-Coder-30B, I have request for few early & recent models. Please create GGUFs if possible. Thanks on behalf of all.
AffectionateOcelot7@reddit
I don't see any Qwen 3.6 numbers, your KLD and Perplexity numbers are only posted for 3.5. So are you assuming that the numbers are the same for 3.6 or is there another benchmark that I'm missing?
streppelchen@reddit
found the models by accident, will still need to give them a try, but i like the idea, keep it up :)