Need advice: Qwen3.6 27B MTP or 35B-A3B MoE MTP on 16GB VRAM RTX 5080)?
Posted by craftogrammer@reddit | LocalLLaMA | View on Reddit | 9 comments
Hey folks, looking for advice before I delete or keep a huge model file.
I’m testing local coding/agentic workflows on an RTX 5080 16GB + 96GB RAM. I already have Qwen3.6-35B-A3B-MTP running with llama.cpp MTP branch on Windows native, using CPU expert offload.
Current A3B setup:
Qwen3.6-35B-A3B-MTP Q8_0 GGUF --fit on --fit-target 1536 --n-cpu-moe 34 -c 232144 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --batch-size 2048 --ubatch-size 1024 --cache-ram -1 --checkpoint-every-n-tokens 8192 --spec-type mtp --spec-draft-n-max 2
At my previous \~196K context setting, around 118K active prompt, I was seeing roughly \~1178 tok/s prefill and \~32 tok/s decode. Follow-ups around 118K–143K active prompt were usually \~32–37 tok/s when MTP acceptance was good. DraftN=3 worked, but over-drafted too often at deep context, so DraftN=2 became my stable setting.
Now I’m testing 232K context with the same A3B setup.
I downloaded the new Qwen3.6-27B dense MTP grafted GGUF / UD XL model too, but it’s around 30GB and I only have \~4GB left on my C drive. Before I delete something or keep both, I’m trying to understand if people with similar hardware have actually compared these.
Question: on 16GB VRAM + lots of system RAM, would you keep testing Qwen3.6-27B dense MTP, or stick with Qwen3.6-35B-A3B MoE + CPU expert offload + MTP?
I’m especially interested in real experience at 100K+ active prompt, not just short-prompt tok/s.
Things I’m trying to understand:
- Does 27B dense MTP actually beat 35B-A3B MTP + CPU expert offload on 16GB VRAM?
- At deep context, does dense 27B feel smoother, or does A3B still win because active params are much lower?
- For sustained coding-agent use, is dense consistency better than MoE active-param efficiency?
- If you tested both, which one would you keep if disk space was tight?
I’m not trying to win a benchmark. I care about speed, context, and coding quality for long-running local agent work, tool usage etc.
PositiveBit01@reddit
I use 35b-a3b. Even a q4 probably won't completely fit 27b in your gpu.
Obviously 35b is bigger, but it's also a MoE model which is less impacted by splitting gpu/cpu. It's ok if some spills. It only has 3b active parameters so it's ~9x faster and some of the experts are more common or shared and used more frequently and if you use llana.cpp with
--fit onit will try to put the more important ones on your gpu first.All that to say, 35b will feel a lot better for you. It'll be much, much faster - faster than it feels like it should be.
27b does look like the smarter model, but IMO it won't be worth the performance drop. It'll be a lot slower.
PositiveBit01@reddit
I left out part of your questions. Dense is generally better in terms of quality for its size. With your system, you have a relatively small amount of gpu memory and a lot of system ram. MoE will generally make better use of this.
Even if it did fit, 35b would be faster so it would be a question of responsiveness vs quality.
craftogrammer@reddit (OP)
Got it thanks, responsiveness vs quality tradeoff makes sene. I think I should stick with MoE for now. Or if there will be any GGUF that can fit the dense into my VRAM that can be another thing. I am working with 230K context wiht moe with ~30 t/s so it feels good but not smart for tool use and making decisions.
see_spot_ruminate@reddit
What is the fear or lock in for windows in this hobby?
You are vram limited, in that situation MOE all the way.
You will not get all three of speed (fast), context (cheap), and coding quality (good). You get to pick 2.
Uncle___Marty@reddit
I'd honestly go for the 35b because the A3B part will just FLY and its not far behind the 27B. You could also afford a higher quant with the 35B.
KubeCommander@reddit
And the higher quant will yield better outcomes vs lower quant on 27B. There’s misinformation out there about how quants are ‘only a few percentage points off’ but are pretty much tied to math.
For agentic processes that aren’t doing coding tasks (like prompt generation, design / spec docs, devops processes, etc), the level of typos that Q4 makes, on 27B and 35B, is quite high. High enough to be disruptively annoying
Unlike code, a compiler can tell an agent that it f-d up and exactly where usually. Not so with document generation. Making up function names in a spec sheet or version numbers is a thing that happens, and that will break shit in hard to find ways. You’d need a peer review for everything generated which adds a lot of extra overhead and time. 27B is a little better than 35B around this apples to apples. But 35B with say Q5 or Q6 is significantly better and faster than the alternative.
WigglyScrotum@reddit
I'd say the 27B model handles being pushed down to Q3 quite well, retaining good coherence. You can try imatrix quants to get an edge in retaining quality while keeping VRAM usage low. I've tested it, and it seems slightly better at Q3_K_XL on the 27B compared to an IQ4_NL_XL 35B-MoE—though that's subjective, since I don't lean too hard on it for coding and mostly use it as an assistant.
It makes sense, as the dense architecture still fares better in intelligence. Still, they are really close IMO, and the speed tradeoff isn't worth it for running the 27B.
On the other hand, with MTP you'll be able to fit the 27B at Q3, but I think you'll need to trade in some context size since the MTP heads add VRAM usage on top of the base model. You'll probably see a good speed improvement, but it needs more testing as it's still in draft. Sadly, I'm on AMD (also on 16GB) and it's still broken for me, at least on my RX 6900 XT.
So I'd say since you're on CUDA, test it out at Q3 and run some benchmarks. That's the best way to see if it pays off for you.
Maharrem@reddit
27B won't fit. Q4_K_M is ~17GB before KV cache, so on 16GB you're spilling to CPU and getting single-digit t/s. 35B-A3B MoE is the play here, the full file sits in system RAM but only 3B active params per token, so even with spilling it's way snappier. I'd run it with llama.cpp
--fitto keep shared experts in VRAM and you'll get interactive speeds no problem, just make sure you've got 32GB+ system RAM to hold the GGUF. You can also look at canitrun.dev to see what models your hardware can run.Icaruszin@reddit
I might be wrong but I think in this case MTP doesn't matter much if you can't fit the entire 27B model in the VRAM: it's gonna be hella slower anyway. I would keep the 35B.