mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF just released !
Posted by PhotographerUSA@reddit | LocalLLaMA | View on Reddit | 39 comments
Description of the module:
I host 30+ free APEX MoE quantizations as independent research. My only local hardware is an NVIDIA DGX Spark (122 GB unified memory) — enough for \~30-50B-class MoEs, but bigger ones (200B+) require rented compute on H100/H200/Blackwell, typically $20-100 per quant.
If APEX quants are useful to you, your support directly funds those bigger runs.
Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled — APEX-MTP GGUF
APEX (Adaptive Precision for EXpert Models) quantizations of lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled, with the MTP (multi-token prediction) head bundled for in-the-box self-speculative decoding.
What's different from the plain APEX repo?
These GGUFs bundle the model's MTP (multi-token prediction) head alongside the trunk in a single file, courtesy of llama.cpp PR #22673. With a recent llama.cpp (>= commit 255582687) you can enable self-speculative decoding using just this one file — no separate draft model needed:
llama-server -m Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-I-Balanced.gguf --draft-mtp
The non-MTP version is still available at mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-GGUF — slightly smaller, but no self-spec.
File sizes
Each quant is \~2.5% larger than its non-MTP counterpart (one extra transformer-block worth of weights, no embedding duplication since MTP shares the trunk's embed_tokens).
MTP draft head precision
The bundled MTP head (blk.40.* including the nextn.* projection + norms) is quantized to Q8_0 (near-lossless) on every tier except I-Nano. I-Nano keeps the trunk-tier precision on the MTP block (Q3_K routed experts, Q4_K attention) but pins blk.40.nextn.eh_proj to Q4_K — see the explainer below.
This keeps draft accuracy high (important for spec-decode acceptance rate) at a modest \~1 GB cost per file vs. trunk-tier precision.
Why the MTP head doesn't use imatrix
llama-imatrix runs normal forward passes that only activate the trunk (blk.0..blk.39). The MTP head only fires during --draft-mtp spec decoding, so its tensors get no imatrix activation data. We work around this by quantizing the MTP head with static K-quant / Q8_0 which doesn't require imatrix.
(A patch to llama-imatrix that records MTP activations during collection is in progress at mudler/llama.cpp#mtp-imatrix — once upstream this will let us push the drafter to lower bit-widths cleanly.)
What is APEX?
APEX is a MoE-aware mixed-precision quantization strategy. Per-tensor-role gradient: routed experts compress hardest, shared experts kept high (always active), attention/Mamba uniform; 5+5 symmetric edge gradient across the 40 trunk layers + MTP layer 40 at edge precision. I-variants use diverse imatrix calibration (chat, code, reasoning, tool-calling, agentic traces, Wikipedia).
- Base: Qwen 3.6 35B-A3B family (Qwen3_5MoeForCausalLM)
- Layers: 40 trunk + 1 MTP (bundled)
- Experts: 256 routed + 1 shared (8 active per token)
- Hidden size: 2048
- Calibration: v1.3 diverse dataset
leonbollerup@reddit
Thanx, we are now 50+ destills .. and it’s hard so see where these are better
Sisaroth@reddit
I've tried so many different quants now and I keep going back to byteshape/Qwen3.6-35B-A3B-GGUF:Qwen3.6-35B-A3B-IQ4_XS-4.15bpw.
Everything else I tried is both slower and less quality (much more looping, slower to find good coding solutions to my prompts).
Just today I have been trying mradermacher/Carnice-Qwen3.6-MoE-35B-A3B-i1-GGUF:IQ4_XS. In theory it looks like it should be a very similar model, it's even bigger but it was complete shit compared to the byteshape quant, constantly started looping over trivial issues.
Jipok_@reddit
That’s weird. For me, the Byteshape quant(35B-A3B-IQ4_XS-4.15bpw) actually performed worse than Unsloth’s. It would spit out tens of thousands of tokens of reasoning and eventually spiral into an endless loop. I even tried disabling KV cache compression altogether, but it didn't help.
To be fair, Unsloth also suffers from this looping issue and generally gives low-quality outputs on my specific tasks. That's why I ended up going back to Gemma. With Gemma, I don't deal with those endless miles of reasoning, and its KV cache footprint is so small that I don't even have to quantize it.
Maybe once I finally set up my second 3090 and can run higher-precision quants of Qwen, things will change. But for now, on a single 3090, Gemma is just a much better fit.
enrique-byteshape@reddit
Hey! What hyperparams did you use to run our model? This release seems to be a bit more finicky in that sense, as qwen 3.6 in general is more finicky. Sorry to hear it hasn't worked out
Jipok_@reddit
-m byteshape_Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf -ngl all -fa on -ub 1024 -c 100000 --parallel 2 --host 0.0.0.0 --webui-mcp-proxy --cache-ram 0 --spec-type draft-mtp --spec-draft-n-max 4
enrique-byteshape@reddit
Try these as well, it tends to overthink:
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.05 --presence-penalty 0.5 --repeat-penalty 1.05
Jipok_@reddit
What would you recommend for KV cache compression? I assume Qwen’s KV cache is significantly "fatter" than Gemma's.
My goal is to run at least 3 parallel requests with \~50k context each on a single 3090.
enrique-byteshape@reddit
We don't work with KV cache compression for now, so we don't know the effect it has on our models, but in theory Q8 for KV should be fine in terms of quality. Going to Q6 should also be okay-ish, but each layer will have their own tolerances, so who knows
vastaaja@reddit
Would you mind sharing the model and command you use for Gemma?
Jipok_@reddit
llama-server -m /hdd/gguf/unsloth_gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -ngl all -fa on -ub 1024 -c 225000 --parallel 3 --host 0.0.0.0 --no-context-shift --webui-mcp-proxy
vastaaja@reddit
Thank you, I'll try it out. I've been mostly using qwen 3.6 27b and 35b-a3b, and am curious to see the difference.
ghostynewt@reddit
Byteshape’s largest is also the one I’ve settled on. It’s a good model
rm-rf-rm@reddit
and do any have a distilled dataset that is meaningfully large to even make a difference?
KURD_1_STAN@reddit
I hope all release what they trained on so we can combine all and get actual improvement cause u ain't getting any improvement from 1000 chats
sagiroth@reddit
This most of the time they only benchmaxxing. We need a proper check
crossoverXYZ@reddit
been running this locally and ngl the quality jump is real. quantized to Q5_K_M and it barely loses anything
adam_suncrest@reddit
how did you get your hands on a Q5_K_M quant? there is no such quant available in the hf repo, I think APEX does things differently than good ol' Q[N] quants
themixtergames@reddit
"tbh" is a new tell for May 2026
CheatCodesOfLife@reddit
and "ngl"
Thin_Pollution8843@reddit
This is very weak models. I prefer Qwenum3.6-29B-PROFESSIONAL-OpusKiller-BENTLEY-ReasonablyUnreasoble-UNLEASHED-DenseAF-LGBTQ-TrumpNo1-RAPPER-DieSamAltman-ISPENDTIMEONUSELLESDISSTILLMODELS-AbsolutelyNotCringe-MTP-GGUF
ghostynewt@reddit
Not a heretic abliteration quant? Lame
PhotographerUSA@reddit (OP)
Can this work on my 8GB video card with 64GB of ram? 4bit or 3bit?
1998marcom@reddit
I guess the jump from 27B to 29B was for storing the model's name.
Ok_Needleworker_6431@reddit
What do you do with these setups? Coding tasks? Ik super curious to learn how folks have been actually using these local models in their workflows?
the-username-is-here@reddit
Duh, post on Reddit of course.
These models are benchmaxxed and overhyped by "mum bought me a GPU" AI engineers. When you try to use these models for anything serious, they fall apart.
Pauldb@reddit
Not true, this very specific model, I've been using for nearly two weeks. And it's very good, better than the base 35b a3b, in that the quantization is done smarter (think dynamic bitrate in mp3: gives more precision to part/layers than need them more. And less to others: APEX), which makes it small, but better quality than non APEX modelq the same size. The Opus distill, improves it ever so slightly, but very useful in agentic scenarios, where it will now, but not before, create a list of how it plans tonsolce the problem and decompose it and go through it, à la Claude. Finally the MTP is excellent as it can bring from 30 to 90% percent increase in token generation speed, for one gig more only !
This guysm gets it, and is doing fantastic Work !
Internal_Werewolf_48@reddit
Ability to plan and work through tasks in a plan is more about the harness and quant than a model fine tune. I’ve been using Qwen3.6 35B @ Q6 with Hermes and Qwen Code CLI for weeks now and it’s handled plans just fine.
old-mike@reddit
Thank you very much. These work really fast in muy setup.
CYTR_@reddit
It's still a bit embarrassing these fine-tunes whose usefulness is more than questionable and whose names suggest the author had a stroke.
Velocita84@reddit
And apex quants suck too
christianweyer@reddit
https://huggingface.co/mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF
rawdikrik@reddit
I might be dumb, but I dont see the link to the MTP version
Potential-Leg-639@reddit
It‘s a very, very confusing post indeed
IGZ0@reddit
Hopefully it didn't learn Claude 4.7's tendency to just go "Let's do it in another session" after 5 prompts
Agent007_MI9@reddit
The Qwen3 MoE base is already punching well above its active parameter weight, so distilling Claude 4.7 Opus reasoning patterns into it is a pretty clever combination. The A3B active params at inference should keep it snappy on consumer hardware. Curious how the reasoning traces compare to straight Qwen3-235B-A22B on multi-step coding tasks -- anyone benchmarked it yet?
10F1@reddit
First, thanks for your work!
How does the quality and speed of the model compare to something like unsloth?
PhotographerUSA@reddit (OP)
This isn't my work I just think the module looks promising .
FastHotEmu@reddit
I see APEX, I upvote and save to try later 😄
Snoo_27681@reddit
Thanks for this, I'll try it. Why do you the distilled models with only Opus 4.7 and K2?