Llama.cpp MTP with Qwen3.6 27B on Headless RTX 3090
Posted by cleversmoke@reddit | LocalLLaMA | View on Reddit | 42 comments
Saw some posts around PP being slower, so they were cautious on trying it.
Here's a real-world datapoint.
Settings:
- Headless RTX 3090 24G
- Model unsloth's Qwen3.6-27B-MTP-Q4_K_M.gguf
- 128 context
- q8_0 kv cache
- --spec-draft-n-max: 3
- --draft-p-min: 0
Use Cases:
- Research task that uses \~85,000 tokens
- Coding task that uses \~85,000 tokens.
Without MTP (llama.cpp:server-cuda13-b9174):
- PP: 1,050 tok/s
- TG: 27 toks/s
- Total time to complete 85k tokens: \~39 mins
With MTP (latest master fork):
- PP: 600 tok/s (down 42%)
- TG: 50 tok/s (up 85%)
- Total time to complete 85k tokens: \~23 mins (1.7x faster or 41% reduction)
A 41% time savings is quite huge, so unless you're PP heavy, I'd recommend giving MTP a try on your use cases! I have it on a dual agent set-up so your total processing times may be better since I have another critic agent check the main agent's work.
chelijenardi@reddit
What launch settings did you use? I wantrd to try MTP but getting OOM on a 24G 4090
cleversmoke@reddit (OP)
- "--model"
- "/models/unsloth_Qwen3.6-27B-MTP-Q4_K_M.gguf"
- "--alias"
- "qwen3.6-27b"
- "--spec-type"
- "draft-mtp"
- "--spec-draft-n-max"
- "3"
- "--draft-p-min"
- "0.0"
- "--jinja"
- "--reasoning-format"
- 'deepseek'
- "--chat-template-kwargs"
- '{"preserve_thinking":true}'
- "--no-mmproj-offload"
- "--ctx-size"
- "131072"
- "--fit"
- "on"
- "--fit-ctx"
- "131072"
- "--fit-target"
- "512"
- "--ctx-checkpoints"
- "16"
- "--cache-type-k"
- "q8_0"
- "--cache-type-v"
- "q8_0"
- "--flash-attn"
- "on"
- "--n-gpu-layers"
- "99"
- "--parallel"
- "1"
- "--threads"
- "8"
- "--threads-batch"
- "8"
- "--batch-size"
- "512"
- "--ubatch-size"
- "512"
- "--no-mmap"
- "--temperature"
- "0.6"
- "--top_p"
- "0.95"
- "--top_k"
- "20"
- "--min_p"
- "0.0"
- "--presence_penalty"
- "0.0"
- "--repeat_penalty"
- "1.0"
- "--n-predict"
- "32768"
cleversmoke@reddit (OP)
If you're not on headless, you'll have to lower your context or quant size to account for the 1-1.5GB display overhead!
chelijenardi@reddit
Yeah I'm not doing headless but I was able to do 128k context + mmproj fine without MTP. No idea how to calculate things with MTP. Gonna play around with it more. Thanks for the info!
cleversmoke@reddit (OP)
No problem! I'm guessing you're using either Q5_K_M or Q5_K_S if you're at 128k context, if at q8_0 KV cache. To help speed up your investigation, account for a 2.5-3GB overhead from MTP. This will likely mean a step down of quant (Q4_K_M or Q4_K_S) or reducing your context by around 40-50k if you want to keep Q5.
There's a hidden MTP vram tax after first token and you'll see your vram creep up 0.1GB every 10k context to about 1GB and it'll stabilize.
chelijenardi@reddit
Have been using Q4_K_XL actually, on both normal and MTP version.
Am getting good results currently with 86k context and --spec-draft-n-max 4 so probably going to stick with that for a bit.
Synthetic451@reddit
Blargh, I got excited about your post until I realized you were running headless. I've been struggling to get 128k context size to not OOM. My desktop session takes 1.4 - 2.0 GB of VRAM and that's enough to push everything over the edge. I've had to drop it down to 110k context and restrict gpu layers to 65 instead of allowing it to grab the full 66. I am getting around 40 tok/s a though so I guess it isn't terrible. I hate that my 24GB VRAM is right at the limit for a lot of useful tasks.
cleversmoke@reddit (OP)
Oh! Try also adding the flag "--ctx-checkpoints 16". Default is 32 and OOMed me a few times. Now with 16, it's much safer.
Are you going to get a super cheap GPU specifically for display so you can go headless, or PC mobo won't allow?
Living-Office4477@reddit
Cannot wait for PP speed to increase, it's really dragging the improvements down for me, in coding with larger files i really feel it
legit_split_@reddit
Just got merged 😄
https://github.com/ggml-org/llama.cpp/pull/23198
Living-Office4477@reddit
Oh wow, lovely!! Any idea when it will hit docker release?
Synthetic451@reddit
Feels like builds take a day to come out on docker
legit_split_@reddit
Not sure, don't follow the cadence.
I use these images for my Mi50s which get updated daily:
https://hub.docker.com/r/mixa3607/llama.cpp-gfx906
tomz17@reddit
Not sure that's in the card, as MTP fundamentally requires those extra forward passes (e.g. 1 pass per MTP token). Therefore I keep it turned off, even in VLLM.
The big win for MTP are small-context requests with more deterministic outputs (e.g. code).
cleversmoke@reddit (OP)
Yea same, I was seeing PP speeds in the 100s tok/s in some areas in the logs too, hence, the full test on what I actually use my agents for and if PP applies to me or not.
How many files are you working with? I believe the geniuses over at llama.cpp will get PP speed to increase eventually!
TechTefa@reddit
thats really nice man, MTP and also turbo quants are awesome, new op shit every month xD
sadly i cant run 27b so i am waiting for llama cpp update to run ZAYA1-8B, it seems freaking epic for 4gb vram
tomz17@reddit
what is the ratio of prompt to tg, because for most agentic workflows I've used the prompt dominates over generation.
Given that you are (presumably) outputting 85k tokens per task, I'd say your paritcular use case is very atypical, no?
cleversmoke@reddit (OP)
My PP:TG ratio is about 1:1.5 to 1:2. Reason being, I use docs, skills, and planning files, because it helps keep the PP low.
85k total tokens in a session. Unsure if my use cases are atypical, though it's not one-shot prompting, rather I build a small microservice at a time (~5 api endpoints, <10 files) or research loop kept between 5-10 units only (e.g. 5 stock tickers, 5 restaurants, etc.). This helps keep the output quality high as it tries to avoid compaction, 1 compaction is sometimes ok, but even at 1 compaction, I see some degradation.
tomz17@reddit
Interesting. In that use-case MTP is going to beneficial (as you've seen). My coding sessions are more like an order of magnitude different. The agent spends a LOT more time reading code than writing things (tool calls, code, etc)
sagiroth@reddit
What sort of harness and tool you use to measure that? Would like to give it a try
cleversmoke@reddit (OP)
OpenCode!
sagiroth@reddit
I couldn't find where opencode shoes speeds
cleversmoke@reddit (OP)
I view my llama.cpp logs for token speeds. OpenCode will only show total time and total tokens, in my experience. I have llama.cpp set up as a docker container, but logs should be viewable even ran natively via terminal.
notlongnot@reddit
Llama swap in front of llama gives a cleaner 🪵
jacek2023@reddit
I switched to MTP on my pi coding and it works quite well. I have similar speed as you but on Q8 (but three GPUs not one).
cleversmoke@reddit (OP)
Awesome, I wish I can go to Q8, one day, one day!
jacek2023@reddit
I run actual codings tasks in a loop for few hours (coding, documentation, tests, commits). I may try lower quant too to increase speed later, but for now Q8 is very useable.
cleversmoke@reddit (OP)
Same, I was at Q5_K_S without MTP at 138k context, but had to go down to Q4_K_M for 128k context. Since I have a critic subagent on a 2nd RTX 2060 12G using DeepSeek-R1-Distill-Qwen-14B, the coding output has been really solid. I'd say about a 98% success rate on Python coding tasks. I judge this by final output and how much I either have it fix errors (up to 3 fixes for complex, less than 10 file tasks) or having to go to Claude Sonnet 4.6 (more complex 10+ file tasks).
burdzi@reddit
Sounds nice 🤩 How do you run this subagent in opencode? Automatically in a loop? Or do you trigger it manually?
cleversmoke@reddit (OP)
I posted info on subagent here! https://www.reddit.com/r/LocalLLM/s/EkrH7ETwkY
Tldr; I use the strengths of 2 agents than 1. The main coding agent (Qwen3.6) is great at writing bulk code, and pairing it with a reasoning agent (DeepSeek-R1-Distill-Qwen or Gemma-4) that only focuses on certain aspects to a module (security, no bloat, efficiency, etc) helps make the output from the main agent is of high quality.
I use OpenCode's config.jsonc to set up the agent and subagent, and then skills to tell the agent to invoke the subagent at certain parts of its instructions. All automatic.
burdzi@reddit
Thank you! I will take a look 😄
jacek2023@reddit
cleversmoke@reddit (OP)
Nice, this is the way. Docs, plans, and skills
jacek2023@reddit
I downloaded Q3 with MTP will try to run it to show you soon
hurdurdur7@reddit
Interesting. On my current coding session i see that my pi "status line" looks like this:
(up arrow) 162K (down arrow) 20k R 4.0M 65.8%/250K ... I am not really convinced i would start saving time if i switched to the current MTP.
For now i am running the speculation model 0.8B at Q8. Yes i have the extra context penalty due to this, but while coding i still get a decent speedup (bump from 20 -> 27 tokens per second while actual code generation) when compared just to the "naked 27B" itself.
Of course this very workflow dependent, but my current task is very data processing heavy, so i need a fast PP speed too (sometimes the ratio of input and output is even more skewed ... and my current chattery there is with reasoning turned on).
jacek2023@reddit
check your speeds, because 10x20 is still more than 162
hurdurdur7@reddit
But i already get a speedup from the speculation model. If i drop that advantage and jump to mtp not that much changes. But PP will take a steep hit.
SimShelby@reddit
for me 50 tps is good for daily production , even for production with claude code
PhysicalIncrease3@reddit
How much extra VRAM does MTP consume?
I run a 3090 headless too, but with the unsloth Q5 gguf. I make it fit with 110k context (Q8) by using --no-mmproj-offload. Vision is obviously slower as a result but this is fine for my use case.
Problem is that I'm right at the VRAM limit, so I will have to reduce context significantly to get MTP to fit. Wonder if the tradeoff is worth it.
cleversmoke@reddit (OP)
MTP overhead consumes about 2.5GB of vram or 50k less context, from my experience with it. With Qwen3.6-27B, I had to go from Q5_K_S with 138k context at q8_0 KV cache to Q4_K_M with 128k context at q8_0 KV cache. My uses cases require at least 120k context so I had to step down in quant.
I say give it a try! You'll lose maybe an hour or two tuning to your set up and testing, but it's free to try otherwise!
PhysicalIncrease3@reddit
That's not too bad! Definitely worth a try. Thanks for your advice.
allpowerfulee@reddit
Has anyone compared the llama.cpp with omlx?