mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF just released !

Posted by PhotographerUSA@reddit | LocalLLaMA | View on Reddit | 39 comments

Description of the module:

I host 30+ free APEX MoE quantizations as independent research. My only local hardware is an NVIDIA DGX Spark (122 GB unified memory) — enough for \~30-50B-class MoEs, but bigger ones (200B+) require rented compute on H100/H200/Blackwell, typically $20-100 per quant.
If APEX quants are useful to you, your support directly funds those bigger runs.

Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled — APEX-MTP GGUF

APEX (Adaptive Precision for EXpert Models) quantizations of lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled, with the MTP (multi-token prediction) head bundled for in-the-box self-speculative decoding.

What's different from the plain APEX repo?

These GGUFs bundle the model's MTP (multi-token prediction) head alongside the trunk in a single file, courtesy of llama.cpp PR #22673. With a recent llama.cpp (>= commit 255582687) you can enable self-speculative decoding using just this one file — no separate draft model needed:

llama-server -m Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-I-Balanced.gguf --draft-mtp

The non-MTP version is still available at mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-GGUF — slightly smaller, but no self-spec.

File sizes

Each quant is \~2.5% larger than its non-MTP counterpart (one extra transformer-block worth of weights, no embedding duplication since MTP shares the trunk's embed_tokens).

MTP draft head precision

The bundled MTP head (blk.40.* including the nextn.* projection + norms) is quantized to Q8_0 (near-lossless) on every tier except I-Nano. I-Nano keeps the trunk-tier precision on the MTP block (Q3_K routed experts, Q4_K attention) but pins blk.40.nextn.eh_proj to Q4_K — see the explainer below.

This keeps draft accuracy high (important for spec-decode acceptance rate) at a modest \~1 GB cost per file vs. trunk-tier precision.

Why the MTP head doesn't use imatrix

llama-imatrix runs normal forward passes that only activate the trunk (blk.0..blk.39). The MTP head only fires during --draft-mtp spec decoding, so its tensors get no imatrix activation data. We work around this by quantizing the MTP head with static K-quant / Q8_0 which doesn't require imatrix.

(A patch to llama-imatrix that records MTP activations during collection is in progress at mudler/llama.cpp#mtp-imatrix — once upstream this will let us push the drafter to lower bit-widths cleanly.)

What is APEX?

APEX is a MoE-aware mixed-precision quantization strategy. Per-tensor-role gradient: routed experts compress hardest, shared experts kept high (always active), attention/Mamba uniform; 5+5 symmetric edge gradient across the 40 trunk layers + MTP layer 40 at edge precision. I-variants use diverse imatrix calibration (chat, code, reasoning, tool-calling, agentic traces, Wikipedia).

Architecture

Base: Qwen 3.6 35B-A3B family (Qwen3_5MoeForCausalLM)
Layers: 40 trunk + 1 MTP (bundled)
Experts: 256 routed + 1 shared (8 active per token)
Hidden size: 2048
Calibration: v1.3 diverse dataset

[-]

Jipok_@reddit

That’s weird. For me, the Byteshape quant(35B-A3B-IQ4_XS-4.15bpw) actually performed worse than Unsloth’s. It would spit out tens of thousands of tokens of reasoning and eventually spiral into an endless loop. I even tried disabling KV cache compression altogether, but it didn't help.
To be fair, Unsloth also suffers from this looping issue and generally gives low-quality outputs on my specific tasks. That's why I ended up going back to Gemma. With Gemma, I don't deal with those endless miles of reasoning, and its KV cache footprint is so small that I don't even have to quantize it.
Maybe once I finally set up my second 3090 and can run higher-precision quants of Qwen, things will change. But for now, on a single 3090, Gemma is just a much better fit.

[-]

enrique-byteshape@reddit

Hey! What hyperparams did you use to run our model? This release seems to be a bit more finicky in that sense, as qwen 3.6 in general is more finicky. Sorry to hear it hasn't worked out

[-]

Jipok_@reddit

-m byteshape_Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf -ngl all -fa on -ub 1024 -c 100000 --parallel 2 --host 0.0.0.0 --webui-mcp-proxy --cache-ram 0 --spec-type draft-mtp --spec-draft-n-max 4

[-]

enrique-byteshape@reddit

Try these as well, it tends to overthink:
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.05 --presence-penalty 0.5 --repeat-penalty 1.05

[-]

Jipok_@reddit

What would you recommend for KV cache compression? I assume Qwen’s KV cache is significantly "fatter" than Gemma's.

My goal is to run at least 3 parallel requests with \~50k context each on a single 3090.

[-]

enrique-byteshape@reddit

We don't work with KV cache compression for now, so we don't know the effect it has on our models, but in theory Q8 for KV should be fine in terms of quality. Going to Q6 should also be okay-ish, but each layer will have their own tolerances, so who knows

[-]

vastaaja@reddit

Would you mind sharing the model and command you use for Gemma?

[-]

Jipok_@reddit

llama-server -m /hdd/gguf/unsloth_gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -ngl all -fa on -ub 1024 -c 225000 --parallel 3 --host 0.0.0.0 --no-context-shift --webui-mcp-proxy

[-]

vastaaja@reddit

Thank you, I'll try it out. I've been mostly using qwen 3.6 27b and 35b-a3b, and am curious to see the difference.

leonbollerup@reddit

Thanx, we are now 50+ destills .. and it’s hard so see where these are better

Sisaroth@reddit

I've tried so many different quants now and I keep going back to byteshape/Qwen3.6-35B-A3B-GGUF:Qwen3.6-35B-A3B-IQ4_XS-4.15bpw.

Everything else I tried is both slower and less quality (much more looping, slower to find good coding solutions to my prompts).

Just today I have been trying mradermacher/Carnice-Qwen3.6-MoE-35B-A3B-i1-GGUF:IQ4_XS. In theory it looks like it should be a very similar model, it's even bigger but it was complete shit compared to the byteshape quant, constantly started looping over trivial issues.

ghostynewt@reddit

Byteshape’s largest is also the one I’ve settled on. It’s a good model

rm-rf-rm@reddit

and do any have a distilled dataset that is meaningfully large to even make a difference?

KURD_1_STAN@reddit

I hope all release what they trained on so we can combine all and get actual improvement cause u ain't getting any improvement from 1000 chats

sagiroth@reddit

This most of the time they only benchmaxxing. We need a proper check

crossoverXYZ@reddit

been running this locally and ngl the quality jump is real. quantized to Q5_K_M and it barely loses anything

adam_suncrest@reddit

how did you get your hands on a Q5_K_M quant? there is no such quant available in the hf repo, I think APEX does things differently than good ol' Q[N] quants

themixtergames@reddit

"tbh" is a new tell for May 2026

CheatCodesOfLife@reddit

and "ngl"

Thin_Pollution8843@reddit

This is very weak models. I prefer Qwenum3.6-29B-PROFESSIONAL-OpusKiller-BENTLEY-ReasonablyUnreasoble-UNLEASHED-DenseAF-LGBTQ-TrumpNo1-RAPPER-DieSamAltman-ISPENDTIMEONUSELLESDISSTILLMODELS-AbsolutelyNotCringe-MTP-GGUF

Not a heretic abliteration quant? Lame

PhotographerUSA@reddit (OP)

Can this work on my 8GB video card with 64GB of ram? 4bit or 3bit?

1998marcom@reddit

I guess the jump from 27B to 29B was for storing the model's name.

Ok_Needleworker_6431@reddit

What do you do with these setups? Coding tasks? Ik super curious to learn how folks have been actually using these local models in their workflows?

the-username-is-here@reddit

Duh, post on Reddit of course.

These models are benchmaxxed and overhyped by "mum bought me a GPU" AI engineers. When you try to use these models for anything serious, they fall apart.

Pauldb@reddit

Not true, this very specific model, I've been using for nearly two weeks. And it's very good, better than the base 35b a3b, in that the quantization is done smarter (think dynamic bitrate in mp3: gives more precision to part/layers than need them more. And less to others: APEX), which makes it small, but better quality than non APEX modelq the same size. The Opus distill, improves it ever so slightly, but very useful in agentic scenarios, where it will now, but not before, create a list of how it plans tonsolce the problem and decompose it and go through it, à la Claude. Finally the MTP is excellent as it can bring from 30 to 90% percent increase in token generation speed, for one gig more only !

This guysm gets it, and is doing fantastic Work !

Internal_Werewolf_48@reddit

Ability to plan and work through tasks in a plan is more about the harness and quant than a model fine tune. I’ve been using Qwen3.6 35B @ Q6 with Hermes and Qwen Code CLI for weeks now and it’s handled plans just fine.

old-mike@reddit

Thank you very much. These work really fast in muy setup.

CYTR_@reddit

It's still a bit embarrassing these fine-tunes whose usefulness is more than questionable and whose names suggest the author had a stroke.

Velocita84@reddit

And apex quants suck too

christianweyer@reddit

https://huggingface.co/mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF

rawdikrik@reddit

I might be dumb, but I dont see the link to the MTP version

Potential-Leg-639@reddit

It‘s a very, very confusing post indeed

IGZ0@reddit

Hopefully it didn't learn Claude 4.7's tendency to just go "Let's do it in another session" after 5 prompts

Agent007_MI9@reddit

The Qwen3 MoE base is already punching well above its active parameter weight, so distilling Claude 4.7 Opus reasoning patterns into it is a pretty clever combination. The A3B active params at inference should keep it snappy on consumer hardware. Curious how the reasoning traces compare to straight Qwen3-235B-A22B on multi-step coding tasks -- anyone benchmarked it yet?

10F1@reddit

First, thanks for your work!

How does the quality and speed of the model compare to something like unsloth?

This isn't my work I just think the module looks promising .

FastHotEmu@reddit

I see APEX, I upvote and save to try later 😄

Snoo_27681@reddit

Thanks for this, I'll try it. Why do you the distilled models with only Opus 4.7 and K2?