Testing llama.cpp MTP support on Qwen3.6 - RTX 5090

Posted by 3VITAERC@reddit | LocalLLaMA | View on Reddit | 30 comments

Setup:

- RTX 5090, 32 GB, Linux

- Built llama.cpp from 4f13cb7 (the official ghcr.io/ggml-org/llama.cpp:server-cuda image hasn't picked up the merge yet as of writing — had to docker build from source with CUDA_DOCKER_ARCH=120)

- Unsloth's Qwen3.6-27B-MTP-GGUF Q5_K_M and Qwen3.6-35B-A3B-MTP-GGUF UD-Q4_K_M

- 128k context, flash-attn, q8_0 KV cache, temp 0.8, --parallel 1 (required for MTP)

- Same GGUF for "MTP on" and "MTP off" — only the --spec-type draft-mtp --spec-draft-n-max 3 flag toggled. This isolates MTP from quant differences.

- 2 prompts: "short story about a cat" (\~400 tokens) and "Flappy Bird clone as a single HTML file" (\~3000 tokens)

- 3 seeds per config, averaged

[-]

OsmanthusBloom@reddit

I'd be interested in your prompt processing speeds. Is there a big difference between MTP and non-MTP chugging through a prompt of, say, 10k tokens?

[-]

No-Dot-6573@reddit

I'd second that. The prompt processing speed seems to be affected as well. So if it is slower, MTP might better be turned off for eg short stories with 27B

[-]

tomByrer@reddit

I wonder if just shorter runs see little benifit?

[-]

legit_split_@reddit

The difference is much lower now:
https://github.com/ggml-org/llama.cpp/pull/23198

[-]

Bulky-Priority6824@reddit

180tks on dual 5060ti with Parallel 2 have you tried

[-]

LLM inference is memory througput bound at np1, the compute utilization is not full. Parallelism and batching and MTP all rely on this fact to use that unused compute, going more than parallel 1 with MTP would in theory have 2 mechanisms competing for the same (compute) resource

[-]

unjustifiably_angry@reddit

I wonder if it's possible to run the draft on a second GPU.

[-]

tomByrer@reddit

maybe
https://github.com/Luce-Org/lucebox-hub/issues/102

[-]

Bulky-Priority6824@reddit

Theory vs measured results. The more I test the more I find.

[-]

Plasmx@reddit

Most likely p2 is better for you because of the dual GPU setup. I don’t think there is a benefit for single GPU setups.

[-]

kitanokikori@reddit

parallel=1 was required for initial versions of this PR but that isn't true any more

[-]

StardockEngineer@reddit

Good to know, thanks.

[-]

dco44@reddit

Running Qwen3 Q4_K_M on the other end of the hardware spectrum — 
iPhone via llama.cpp Metal (Swift SPM, n_gpu_layers=99).

No MTP on iOS yet but greedy decoding at context 2048 is solid:
- 1.7B: ~0.5s per response
- 8B: ~1s (iPhone 15/16 Pro, 8GB RAM)
- 14B: ~3s (iPad Pro M1+, 16GB)

Unified memory makes the RAM math cleaner than CUDA — Q4_K_M 8B 
fits in 8GB with ~3.5GB headroom for OS. Main challenge is OOM 
detection: I gate downloads with a free-memory check and fall back 
to 1.7B if the 8B load fails rather than crashing.

Would be curious if MTP ever lands in the Metal backend.

[-]

unjustifiably_angry@reddit

Same seed used? Made a huge difference in my test.

[-]

fck__spz@reddit

What's your estimation when I can use Gemma 4 26b with MTP in llama.cpp?

[-]

nickm_27@reddit

Hopefully with the next day or two, Aman said he's working on it

[-]

wilo108@reddit

Am I right in thinking llama-bench hasn't been updated to allow testing --spec-draft yet? Will it be?

[-]

go0og@reddit

There's an issue ticket open to add these and seems to be worked on.

[-]

go0og@reddit

Another parameter to use which affects t/s generation is --spec-draft-p-min - start with 0.75; I ended up dropping it all the way down to 0.2.

[-]

Shoulon@reddit

What's a good guide on learning how to build llama.cpp, parameters, and making it be a docker container? I can ask claude for example, but all this specific custom information is far too fragmented it seems like.

[-]

FiLo420blazeit@reddit

Really useful breakdown, thanks for running this. The accept% column is doing all the work here, wherever it hits \~90% (the code prompts) you get a real speedup, and wherever it drops to \~40% (prose) MTP basically just adds overhead. That's expected behavior but it's nice to see it this cleanly on actual hardware.

The spicy result is the 35B MoE on short story going backwards (0.81×). That's the worst case for speculative-style decoding: the base model is already fast (227 tok/s), so a low-acceptance draft can't earn back its own cost nd you net negative. The dense model never goes below 1.0× because its baseline is slow enough that even a bad draft is roughly free.

Practical takeaway seems to be: enable MTP for structured/code-ish workloads, leave it off for creative/open-ended generatian, especially on the MoE.

Curious what your draft settings were (number of speculative tokens, any threshold on acceptance)? Wonder if tuning those pulls the prose numbers back above 1.0× or if low accept% just kills it regardless.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

AmoebaDue6638@reddit

Solid benchmarking methodology, especially isolating MTP by toggling only the spec flags on the same GGUF. Curious how much the gains scale with longer context windows.

[-]

kitanokikori@reddit

This is an interesting result, I consistently get ~10-15 tok/sec on Qwen3.6-35B-A3B on Strix Halo. Params I'm running for an OpenClaw type app (not an expert! don't take these params as gospel):

  llama-server
  -m ./Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-Q6_K.gguf
  --mmproj ./mmproj-Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-f16.gguf
  --spec-type draft-mtp
  --spec-draft-n-max 3
  --jinja
  --host 0.0.0.0
  -ngl 99
  -cram 65536
  --flash-attn on
  --no-mmap
  --fit on
  --reasoning-budget 1536
  --reasoning on
  --cont-batching
  --no-context-shift
  --ctx-checkpoints 128
  --checkpoint-every-n-tokens 2048
  --parallel 2
  --cache-reuse ${LLAMA_CACHE_REUSE:-1536}
  --cache-prompt
  --presence-penalty ${LLAMA_PRESENCE_PENALTY:-0.5}
  --min-p ${LLAMA_MIN_P:-0.0}
  --top-p ${LLAMA_TOP_P:-0.9}
  --top-k ${LLAMA_TOP_K:-20.0}
  --temp ${LLAMA_TEMP:-0.8}
  --chat-template-kwargs '{"preserve_thinking": true}'
  -ctk q8_0
  -ctv q8_0
  -b 2048
  -ub 2048
  --metrics

[-]