Need advice: Qwen3.6 27B MTP or 35B-A3B MoE MTP on 16GB VRAM RTX 5080)?

Posted by craftogrammer@reddit | LocalLLaMA | View on Reddit | 9 comments

Hey folks, looking for advice before I delete or keep a huge model file.

I’m testing local coding/agentic workflows on an RTX 5080 16GB + 96GB RAM. I already have Qwen3.6-35B-A3B-MTP running with llama.cpp MTP branch on Windows native, using CPU expert offload.

Current A3B setup:

Qwen3.6-35B-A3B-MTP Q8_0 GGUF --fit on --fit-target 1536 --n-cpu-moe 34 -c 232144 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --batch-size 2048 --ubatch-size 1024 --cache-ram -1 --checkpoint-every-n-tokens 8192 --spec-type mtp --spec-draft-n-max 2

At my previous \~196K context setting, around 118K active prompt, I was seeing roughly \~1178 tok/s prefill and \~32 tok/s decode. Follow-ups around 118K–143K active prompt were usually \~32–37 tok/s when MTP acceptance was good. DraftN=3 worked, but over-drafted too often at deep context, so DraftN=2 became my stable setting.

Now I’m testing 232K context with the same A3B setup.

I downloaded the new Qwen3.6-27B dense MTP grafted GGUF / UD XL model too, but it’s around 30GB and I only have \~4GB left on my C drive. Before I delete something or keep both, I’m trying to understand if people with similar hardware have actually compared these.

Question: on 16GB VRAM + lots of system RAM, would you keep testing Qwen3.6-27B dense MTP, or stick with Qwen3.6-35B-A3B MoE + CPU expert offload + MTP?

I’m especially interested in real experience at 100K+ active prompt, not just short-prompt tok/s.

Things I’m trying to understand:

  1. Does 27B dense MTP actually beat 35B-A3B MTP + CPU expert offload on 16GB VRAM?
  2. At deep context, does dense 27B feel smoother, or does A3B still win because active params are much lower?
  3. For sustained coding-agent use, is dense consistency better than MoE active-param efficiency?
  4. If you tested both, which one would you keep if disk space was tight?

I’m not trying to win a benchmark. I care about speed, context, and coding quality for long-running local agent work, tool usage etc.