That's a good news...

[-]

bernzyman@reddit

Has the vision fix been included? MTP is designed to be compatible with vision but a bug had been preventing it from working. Dont know if the fix has been merged upstream also?

[-]

sonicnerd14@reddit

I hope. Haven't heard enough people talk about this. Definitely something that should be rectified asap.

[-]

bernzyman@reddit

I won’t have a chance to compile this latest build today to check. But I have a custom build which includes the fix and so I know it works. Great if upstream incorporates MTP and the vision fix

[-]

StorageHungry8380@reddit

It was reported in the other thread that the PR says it's supported.

[-]

bernzyman@reddit

That note was there (in the notes of the MTP enabled GGUF’s on HF) when the bug was in effect. Its stating how vision should be compatible with MTP

[-]

FatheredPuma81@reddit

So uhh... what am I doing wrong? Seems like MTP is only useful for under 30k context single slot inputs...

Qwen3.6 35B 1 slot fresh context: 185t/s (150t/s without MTP)
Qwen3.6 35B 1 slot 40k context: 100t/s (135t/s without MTP)
Qwen3.6 35B 2 slot 40k context: 50t/s (95t/s without MTP)
Qwen3.6 27B 1 slot fresh context: 90t/s (50-ish without MTP)
Qwen3.6 27B 1 slot 40k context: 50t/s (45t/s without MTP. 57t/s spec-draft-n-max = 2 instead of 3)
Qwen3.6 27B 2 slot 40k context: 30t/s (36t/s without MTP. 28t/s spec-draft-n-max = 2 instead of 3)

[*]
# Global defaults
n-gpu-layers = all
threads = 8
parallel = 1
batch-size = 2048
flash-attn = true
mmap = false
mlock = false
cache-reuse = 1
cram = 8192

# MTP/Qwen36-27b-iq4_xs-mtp
[MTP/Qwen36-27b-iq4_xs-mtp]
model = P:\z AI Stuff\LM_Studio\models\unsloth\Qwen3.6-27B-MTP-GGUF\Qwen3.6-27B-IQ4_XS.gguf
cache-type-k = q8_0
cache-type-v = q8_0
ctx-size = 162144
parallel = 1
cont-batching = true
fit = off
kv-unified = 0
min-p = 0
mlock = false
mmap = false
n-gpu-layers = all
presence-penalty = 0.0
repeat-penalty = 1
temp = 0.6
threads = 8
top-k = 20
top-p = 0.95
chat-template-kwargs = {"preserve_thinking":true}
spec-type = draft-mtp,ngram-mod
spec-draft-n-max = 2
spec-ngram-mod-n-match = 24
spec-ngram-mod-n-min = 48
spec-ngram-mod-n-max = 64
#mmproj = P:\z AI Stuff\LM_Studio\models\unsloth\Qwen3.6-27B-MTP-GGUF\mmproj-F32.gguf

# MTP/Qwen36-27b-iq4_xs-mtp
[NoMTP/Qwen36-27b-iq4_xs-mtp]
model = P:\z AI Stuff\LM_Studio\models\unsloth\Qwen3.6-27B-MTP-GGUF\Qwen3.6-27B-IQ4_XS.gguf
cache-type-k = q8_0
cache-type-v = q8_0
ctx-size = 162144
parallel = 1
cont-batching = true
fit = off
kv-unified = 0
min-p = 0
mlock = false
mmap = false
n-gpu-layers = all
presence-penalty = 0.0
repeat-penalty = 1
temp = 0.6
threads = 8
top-k = 20
top-p = 0.95
chat-template-kwargs = {"preserve_thinking":true}
#mmproj = P:\z AI Stuff\LM_Studio\models\unsloth\Qwen3.6-27B-MTP-GGUF\mmproj-F32.gguf

# MTP/Qwen36-35b-iq4_xs-mtp
[MTP/Qwen36-35b-iq4_xs-mtp]
model = P:\z AI Stuff\LM_Studio\models\unsloth\Qwen3.6-35B-A3B-MTP-GGUF\Qwen3.6-35B-A3B-UD-Q3_K_XL.gguf
cache-type-k = q8_0
cache-type-v = q8_0
ctx-size = 262144
parallel = 1
cont-batching = true
fit = off
kv-unified = 0
min-p = 0
mlock = false
mmap = false
n-gpu-layers = all
presence-penalty = 0.0
repeat-penalty = 1
temp = 0.6
threads = 8
top-k = 20
top-p = 0.95
chat-template-kwargs = {"preserve_thinking":true}
spec-type = draft-mtp,ngram-mod
spec-draft-n-max = 2
spec-ngram-mod-n-match = 24
spec-ngram-mod-n-min = 48
spec-ngram-mod-n-max = 64
#mmproj = P:\z AI Stuff\LM_Studio\models\unsloth\Qwen3.6-35B-A3B-GGUF\mmproj-F32.gguf

# MTP/Qwen36-35b-iq4_xs-mtp
[NoMTP/Qwen36-35b-iq4_xs-mtp]
model = P:\z AI Stuff\LM_Studio\models\unsloth\Qwen3.6-35B-A3B-MTP-GGUF\Qwen3.6-35B-A3B-UD-Q3_K_XL.gguf
cache-type-k = q8_0
cache-type-v = q8_0
ctx-size = 262144
parallel = 1
cont-batching = true
fit = off
kv-unified = 0
min-p = 0
mlock = false
mmap = false
n-gpu-layers = all
presence-penalty = 0.0
repeat-penalty = 1
temp = 0.6
threads = 8
top-k = 20
top-p = 0.95
chat-template-kwargs = {"preserve_thinking":true}
#mmproj = P:\z AI Stuff\LM_Studio\models\unsloth\Qwen3.6-35B-A3B-GGUF\mmproj-F32.gguf

[-]

FullstackSensei@reddit

Link to the PR: https://github.com/ggml-org/llama.cpp/pull/22673

[-]

Limp_Classroom_2645@reddit

Merged

[-]

Glad_Claim_6287@reddit

u/yags-lms LM studio when

[-]

cafedude@reddit

LM Studio uses llama.cpp, however, I doubt they have the spec=mtp option in their GUI. That's something they'd need to add.

[-]

theUmo@reddit

Good luck figuring out which version they're running

[-]

FullstackSensei@reddit

Wrappers tend to be quite behind in adding features. There's also a non-zero chance they'll screw adding it up in a way that makes it useless for anything and anyone besides whatever happy flow they thought of.

[-]

Glad_Claim_6287@reddit

Yeah that's my question man.

[-]

Limp_Classroom_2645@reddit

Whenever

[-]

f4nt4@reddit

Build pipeline for that PR/release: https://github.com/ggml-org/llama.cpp/actions/runs/25961507493

[-]

FullstackSensei@reddit

Real men pull master and build from source! /s

[-]

robertpro01@reddit

Lol, I am not a real men anymore, I'm just another vibe coder at this point

[-]

tempedbyfate@reddit

It literally takes me 40 seconds to download and build the repo. just put together a basic shell script that does the clone/pull from master branch and build with your desired args.

[-]

Deep90@reddit

Plus you can literally use an llm to build it for you if you don't know how.

[-]

LocoMod@reddit

Builds fast on an M-series Mac. On a top tier PC it takes forever.

[-]

BigPoppaK78@reddit

Forever? I can build a whole new podman image (with CUDA) in a few minutes. Running Fedora on a 7950x3d.

[-]

FullstackSensei@reddit

But dude might have been comparing to that $500 laptop with an i3 form 15 years ago, because as we all know, technology only advances in fruity land

[-]

snmnky9490@reddit

And the "average" Apple laptop being compared even for like basic consumer use is often $2000-3000 vs that 2012 Windows one that was the cheapest thing Bestbuy had in stock.

[-]

robertpro01@reddit

Yeah, I have an script for compiling it as well

[-]

ionizing@reddit

I even added a menu entry to trigger the shell script from my app. "Server -> Rebuild llama.cpp" checks if there is a new release then gets it and refreshes local build.

[-]

Deep90@reddit

Real men pull the PR branch and build from source 😜

[-]

Borkato@reddit

Wait are those basically replacing git pull && git checkout latest-branch-thingy && cmake whatever && cmake cuda build?

[-]

0-0x0@reddit

[-]

Comfortable-Rock-498@reddit

Georgi Gerganov has done more to improve the world than most if not all AI CEOs

[-]

mdziekon@reddit

It's not just Georgi, don't forget all the contributors who provide their support either in code or testing. This PR was created by Aman Gupta.

[-]

Comfortable-Rock-498@reddit

You're right. A large number of contributors who never get recognition. Come to think of it, not a bad idea to create a recognition portal that automatically fetches all OSS contributions for the purpose of recognition

I think shit like Forbes 30 under 30 is coveted only because we do not have means to recognize and reward contributions to the community, and most people want to be recognized for and feel proud of their work

[-]

relmny@reddit

Taking into account that GG/llama.cpp probably gets almost no recognition from "big" companies/projects/media/etc because of crap like ollama, the contributors get below zero... it's like the 1% of the 1%...

Fuck ollama and the likes! long live llama.cpp!

[-]

Ok_Scientist_8803@reddit

Llama.cpp is complicated* but it gets you everything. Ollama is supposed to be the easiest, but LMstudio seems to be even easier, yet not hiding the fact that it uses llama.cpp.

*Compared to GUI only methods that a non tech savvy person could use.

I believe ollama used to top Google search results for something like "run ai on my computer", but nowadays it's further down.

[-]

Imaginary-Unit-3267@reddit

You know you've become a true Linux nerd when "llama.cpp is complicated, GUIs are not" has you scratching your head and thinking "but you have to click all those buttons in a GUI! with llama.cpp you just type in a few options and flags! it's ez!" Sigh...

[-]

ShaneBowen@reddit

Llama.cpp is the ffmpeg of AI.

[-]

Comfortable-Rock-498@reddit

True about Ollama. I recently built a coding agent which started as a Cline fork. I removed Ollama support just on principle

[-]

tiffanytrashcan@reddit

https://i.redd.it/iezmjnzg9i1h1.gif

[-]

InsensitiveClown@reddit

You know what would really be helpful? Not a recognition portal, but a bounty portal. You want MTP or Turboquant, Rotoquant? Sure, that's great, but implementing features, doing fixes, implies time. Unpaid. Away from your own family, your children. And in the end, most people just feel entitled and spam everything demanding, not asking, demanding, feature X, Y, Z. Software development costs time, effort, and we're all humans. I'm not directing this at you in particular. It's a notorious problem. For example, the developer(s) of curl, or FFMPEG. Their projects are used everywhere, and yet, as critical parts of infrastructure, what do they have, monetarily, that can stabilize them in order to work full time on it? These particular cases are relatively extreme and stabilized, but there are many such cases where corporations, in in some case users, just take, demand, and give nothing in return.

It is an unequal situation, that generates high attrition, burnout, and inevitably results in excellent developers with a good overview of a codebase, leaving because not leaving has a tremendous personal, professional, economical cost.

[-]

Imaginary-Unit-3267@reddit

I agree with the other commenters that this would lead to bad incentives. However, there is an alternative. You can donate money to maintainers of your favorite software projects. :)

[-]

raikounov@reddit

Your intentions are good, but that's how you get the maintainers burnt out reviewing hundreds of slop PRs trying to get the bounty.

[-]

Eisenstein@reddit

Sounds like a solution ripe with unintended consequences. Once you monetize something, you get a completely different incentive structure that the one you had. This leads to effects some of which can be predicted and some which cannot.

[-]

dnsod_si666@reddit

How would you ensure a bounty has been completed before paying out? Or would it be on the bounty-setter to decide and then if they have a bad reputation of not paying fairly, devs won’t take their bounties?

[-]

InsensitiveClown@reddit

If the bounty is, implement X feature to Y app, with Z reward, then if someone accepts, they would need to submit a PR, accept the reviews, and when it is merged, it is done, or at least when there are no further reviews and is pending for acceptance. That's clearly done, the code is public, PR submitted, reviewed. There are escrows, networks for FLOSS bounties, see: https://wiki.p2pfoundation.net/Crowdfunding_for_Free_Software_and_Free_Hardware_Projects

[-]

tmflynnt@reddit

Hmmm interesting ideas.. Just spitballing here: I wonder if something like this could be in kind of an escrow setup and allow for others to join the bounty too?

Maybe below a certain threshold the benefit of the doubt would be given to established devs for when they claim completion (probably with a basic ai verification backing it) and it could get paid out automatically in such cases. And maybe above a certain $$ threshold or where there is a verification problem or appeals happen, then an independent human reviewer could be involved (with a small cut going toward keeping the overall system running)?

[-]

am17an@reddit

Yup!

[-]

Comfortable-Rock-498@reddit

gh0stwriter1234@reddit

Actually the opposite is true because just like reddit karma farming people farm contribution list attribution on github.... in fact the guy that did the MTP PR was helping work on dealing with this issue also in llama.cpp itself by blocking trivial commits as your first commit to llama.cpp the idea is they'd rather have more substantial contributors rather than just 1000 people "fixing" typos just to get on the list.

[-]

Plabbi@reddit

That makes no sense. Then we wouldn't have any models and llama.cpp would be useless.

[-]

Comfortable-Rock-498@reddit

You're conflating AI CEOs with researchers.
We would still have open source (what "open"AI was meant to be)

[-]

Plabbi@reddit

You're conflating AI CEOs with AI researchers.

No I'm not. Researchers have no authority to decide to publish the open weights.

[-]

AnOnlineHandle@reddit

Who do you think does 99.999% of the actual critical work, the person with the relevant education who shows up day after day and makes it happen, or the person who says they can release the weights after and gets the credit?

[-]

Plabbi@reddit

Who do you think does 99.999% of the actual critical work

That has absolutely nothing to do with it. The CEO makes these decisions, or even the board.

You can not independently decide to open source the software that your company is producing, no matter if you do 50% or 100% of the work.

The ultra wealthy.......

what has this got to do with anything?

[-]

AnOnlineHandle@reddit

That has absolutely nothing to do with it.

Uh huh.

[-]

m7l5@reddit

I see your point. I actually give one good point to Zuckerberg for being the one who triggered that initiative. Even if the reason for them made sense economically at that time.

[-]

More-Curious816@reddit

it was Yann LeCun, he was the chef ai scientist at Facebook and the one who shaped Facebook open source culture. he left Facebook now and you can see that Zuckerberg and his folk immediately decided to not release their new models.

[-]

LumpyWelds@reddit

Crap.. I was deluded. Thanks for clearing that up!

[-]

crantob@reddit

DingDingDing. Remember and respect the name Yann LeCun.

[-]

MuDotGen@reddit

Just tried with Intel Arc 140V with Windows Vulkan and Qwen3.6-35B-A3B-MTP IQ4_K_XS, and I was seeing worse speeds, n=2 being best, but worse than single pass. I know this is best for Nvidia GPUs but thought I'd try it nonetheless.

[-]

xjE4644Eyc@reddit

Yes, the t/s improved, but the prompt processing decreased. The overall time to token completion was slower with MTP enabled, at least with Strix Halo. I'll post my results in a bit.

[-]

ElementNumber6@reddit

MTP benefits are very situation-dependent.

Task-Dependent Efficiency: MTP thrives on low-entropy tasks with rigid syntax (e.g., coding) where the acceptance rate for multi-token guesses is high. For creative writing or complex tool calls, acceptance rates plummet, meaning the extra compute spent generating drafts is wasted.
Training Complexity: The model requires specialized pre-training or fine-tuning. Generating multiple tokens at once requires complex architectures to prevent grammatical mismatches.
KV Cache Overhead: Predicting deeper MTP steps (e.g., MTP > 3) requires storing and evaluating significantly larger parallel hidden states. This KV cache overhead can eat into your context capacity on smaller hardware setups.
Poor Fit for MoE: While MTP works beautifully on dense models, Mixture of Experts (MoE) architectures struggle with MTP heads, often yielding little to no inference speedup in practice

[-]

spaceman3000@reddit

Yes this is the issue with MTP right now and strix halo is painfully slow when it comes to PP even without MTP due to low GTT bandwidth so I'll pass for now.

[-]

Goldandsilverape99@reddit

Is support for mmproj/vision included when using mtp?

[-]

DoorStuckSickDuck@reddit

Also doesn't support parallel I think

[-]

Borkato@reddit

What’s the point of parallel if you’re the only one using it? I’ve never understood that

[-]

huzbum@reddit

So if there is more than one context it doesn't blow out your cache and have to re-process everything. This could be agents with sub-agents, or just a UI that does summaries or something.

I use multiple tools, and some of those tools use multiple requests/contexts. Like IntelliJ IDEA AI chat runs like 6+ parallel requests, so I set parallel to 8 and have it cached to system memory with `--cache-ram`.

Otherwise, if you have a super long conversation, it should only have to process the new message, but if any other request comes in, it blows out the cache and has to re-process the entire conversation. It's the difference between less than 1 second to first token and like 10+ seconds.

[-]

Borkato@reddit

This is super helpful thank you!

[-]

nickm_27@reddit

Having multiple cache slots so different workloads all have their own cache.

[-]

Borkato@reddit

Uh what does that mean? Like agents?

[-]

BigPoppaK78@reddit

Yeah, that's one possibility. Agents and sub-agents - so, the ability to delegate tasks and run parallel/background processes that don't bloat up your main context with irrelevant text.

But, if you're also the type of person to have multiple chats going, like one main conversation and then multiple smaller ones where you ask quick one-off questions, this helps with responsiveness.

[-]

Borkato@reddit

Oh! I didn’t know that, this makes a lot of sense actually thank you’

[-]

nickm_27@reddit

That was also fixed

[-]

Endlesscrysis@reddit

One of the first lines is

Tip

MTP is compatible with Vision input.

[-]

audioen@reddit

I have tested it on the PR and it works just fine.

What didn't work is combination of e.g. draft-mtp and ngram-simple, as it seems to disable some recurrent state fallback gizmo. I want something like ngram-simple so that whenever the model is reciting code or its own reasoning, it can go ultra-fast.

[-]

iamapizza@reddit

Looking at the example they gave

add --no-mmproj to disable vision support if not needed (uses less memory)

llama-server ... --no-mmproj

That seems to imply it does support it, and you'd pass the flag to disable it? Or am I reading too much into it

[-]

alew3@reddit

getting 105-110 token/s with unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-Q8_0.gguf on an RTX 5090

[-]

dodiyeztr@reddit

Context window: 20k

[-]

Deep90@reddit

Why would the context window be 20k?

They are correct. I've been running it for like a week or so now at full context because I had the sense to just build the PR branch.

[-]

alew3@reddit

260K, spilling over to RAM. Prompt processing gets very slow though

[-]

fredandlunchbox@reddit

Add tq4?

[-]

coherentspoon@reddit

is the Q8 really worth it over like Q4 or Q5?

[-]

Icy_Butterscotch6661@reddit

So says the rumors

[-]

UmpireBorn3719@reddit

what is your max ctx size tho?

[-]

alew3@reddit

260K, spilling over to RAM. Prompt processing gets very slow though .. going to try Q5

[-]

crantob@reddit

useful. ty. what were you getting without MTP on the same hardware?

[-]

RoutineProperty7061@reddit

The prefill degradation has been fixed?

[-]

spaceman3000@reddit

No

[-]

spacenavy90@reddit

LM Studio support when?

[-]

dave-tay@reddit

Sweet, 40 t/s with Qwen3.6-35B-A3B-UD-Q4_K_M

[-]

StephenSRMMartin@reddit

I went from 28 to 48 t/s on my AMD 6700xt on ROCm. Awesome.

[-]

Pjotrs@reddit (OP)

And before? On 16GB vram I get 45-50 on 4060 and 55-60 on 5070.

Without MTP.

[-]

dave-tay@reddit

Before was 23 t/s with Qwen3.6-35B-A3B-UD-Q6_K_XL.gguf on RTX 5060ti 16gb

[-]

whoisraiden@reddit

What server flags or INI keys do you run it with?

[-]

dave-tay@reddit

```
llama-server \

--model \~/llm/qwen/3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \

--host 0.0.0.0 \

--port 8081 \

--spec-type draft-mtp \

--spec-draft-n-max 2 \

--spec-draft-ngl 99 \

--ctx-size 131072
```

Qwen3.6-35B-A3B-UD-Q4_K_M: 47 t/s
Qwen3.6-35B-A3B-UD-Q6_K_XL: 29 t/s

[-]

whoisraiden@reddit

Okay thank you

[-]

Pjotrs@reddit (OP)

That is crazy jump.

[-]

Limp_Classroom_2645@reddit

So 30% more performance on 3090

[-]

GlobalLadder9461@reddit

For vulkan backend AMD APU I am observing at max 30% increase in speed. What are the results from other vulkan folks.

[-]

StephenSRMMartin@reddit

I didn't see as big a speed up on Vulkan on my 6700xt. But on ROCm it was nearly 2x faster on qwen 3.6 moe. Just awesome.

[-]

u23043@reddit

best result I had on Qwen2.6-27B was 2.2x decode (vulkan, strix halo). 35BA3B was more like 1.25x

[-]

Icy-Roll-4044@reddit

Congratulations

[-]

No_Algae1753@reddit

Have they fixed the slow pp ?

[-]

JazzlikeLeave5530@reddit

My gf also asked me that the other night 😔

[-]

JayPSec@reddit

Thank you for this early Sunday laugh 😂️

[-]

runcertain@reddit

It’s so early that it’s Saturday

[-]

JayPSec@reddit

Sunday is whenever a man wants! 😏️

[-]

jeremyckahn@reddit

Man, AI people really do live in the future

[-]

Borkato@reddit

I know it’s fucking immature but these jokes about pp always get me

[-]

tomz17@reddit

The prefill with MTP is always going to be slower since it requires multiple forward passes. This is doubly extra true with multiple cards linked over a slow interface (e.g. PCIE).

Even with VLLM + nvlink, I still disable MTP for agentic workflows, as the gains from faster generation are almost immediately lost on the prompt processing penalty.

[-]

An_Original_ID@reddit

Instead of using --spec-type mtp, I have 2.5x prompt processing when using a draft model.

Anecdotal Results: MTP = 600 PP Draft Model = 1400 PP Neither = 2200 PP

MTP: Less Predictable Text = 35 TKs Draft: Less Predictable Text = 22 TKs

MTP: Predictable Text = 45 TKs Draft: Predictable Text = 50 - 80 TKs (depending on n predict)

TLDR: Long prompt + Predictable output = Draft model Long prompt + Less predictable output = MTP Short prompt + Predictable output = either Short Prompt + Less predictable output = MTP

For my rag setup, I use Draft instead of "mtp".

Qwen 3.6 27B Q8 on 2 x 3090 250 Watt limit YMMV gl hf gg no re

[-]

thirteen-bit@reddit

What is the draft model compatible with Qwen 3.6 27B?

Qwen 3.5 0.8B?

Or do you mean ngram-mod or similar? These do not require a model.

[-]

An_Original_ID@reddit

Qwen 3.5 2B and 0.8B. 2B runs the same speed as 0.8B so that's the one I typically use at Q8.

I've also tried Qwen 2.5 Coder 1.5B and got similar results (speed and acceptance rate) as 3.5 2B.

[-]

thirteen-bit@reddit

I see, tried it on a single RTX3090 and got no speedup, but I've run everything in Q4 and on a single card, that's probably the reason there was no improvement.

[-]

An_Original_ID@reddit

Could be a few different things. What's your acceptance rate and how many Tokens are you trying to predict? Also, high temp means lower acceptance rate.

[-]

thirteen-bit@reddit

Used MTP test script from the PR itself, so just 192 tokens/request.

Will test with longer prompts and larger context, thank you for the idea. By the way, combination of draft model and ngram is working and even for 192 tokens, check second and third run tps:

$ bash mtp-bench.sh 
--------------- spec-draft-ngram-q3.5-0.8b-q4-0-16-default: starting 3 runs -------------------
Running: ./bin/llama-server -np 1 --model ./models/Qwen3.6-27B-Q4_K_M.gguf --ctx-size 16384 --cache-type-k q8_0 --cache-type-v q8_0 --spec-type ngram-mod,draft-simple --spec-draft-model ./models/Qwen_Qwen3.5-0.8B-Q4_K_M.gguf --spec-draft-n-min 0 --spec-draft-n-max 16
nohup: appending output to 'nohup.out'
--------------- spec-draft-ngram-q3.5-0.8b-q4-0-16-default: run 1/3 -------------------
  code_python        pred= 192 draft= 163 acc= 128 rate=0.785 tok/s=37.8
  code_cpp           pred= 192 draft= 160 acc= 124 rate=0.775 tok/s=38.0
  explain_concept    pred= 192 draft= 135 acc=  92 rate=0.681 tok/s=31.7
  summarize          pred= 192 draft= 218 acc= 140 rate=0.642 tok/s=34.3
  qa_factual         pred= 192 draft= 212 acc= 109 rate=0.514 tok/s=34.4
  translation        pred= 192 draft= 195 acc= 118 rate=0.605 tok/s=37.9
  creative_short     pred= 192 draft= 123 acc= 107 rate=0.870 tok/s=39.0
  stepwise_math      pred= 192 draft= 156 acc= 130 rate=0.833 tok/s=39.8
  long_code_review   pred= 192 draft= 134 acc=  96 rate=0.716 tok/s=33.9

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1496,
  "total_draft_accepted": 1044,
  "aggregate_accept_rate": 0.6979,
  "wall_s_total": 51.22
}
--------------- spec-draft-ngram-q3.5-0.8b-q4-0-16-default: run 2/3 -------------------
  code_python        pred= 192 draft= 171 acc= 150 rate=0.877 tok/s=73.7
  code_cpp           pred= 192 draft= 285 acc= 144 rate=0.505 tok/s=51.1
  explain_concept    pred= 192 draft= 135 acc=  92 rate=0.681 tok/s=32.5
  summarize          pred= 192 draft= 218 acc= 140 rate=0.642 tok/s=34.5
  qa_factual         pred= 192 draft= 212 acc= 109 rate=0.514 tok/s=34.7
  translation        pred= 192 draft= 195 acc= 118 rate=0.605 tok/s=38.0
  creative_short     pred= 192 draft= 123 acc= 107 rate=0.870 tok/s=40.5
  stepwise_math      pred= 192 draft= 156 acc= 130 rate=0.833 tok/s=40.4
  long_code_review   pred= 192 draft= 134 acc=  96 rate=0.716 tok/s=34.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1629,
  "total_draft_accepted": 1086,
  "aggregate_accept_rate": 0.6667,
  "wall_s_total": 45.89
}
--------------- spec-draft-ngram-q3.5-0.8b-q4-0-16-default: run 3/3 -------------------
  code_python        pred= 192 draft= 163 acc= 128 rate=0.785 tok/s=37.9
  code_cpp           pred= 192 draft= 165 acc= 146 rate=0.885 tok/s=64.6
  explain_concept    pred= 192 draft= 204 acc= 109 rate=0.534 tok/s=38.6
  summarize          pred= 192 draft= 166 acc= 165 rate=0.994 tok/s=105.6
  qa_factual         pred= 192 draft= 191 acc= 129 rate=0.675 tok/s=42.3
  translation        pred= 192 draft= 220 acc= 175 rate=0.795 tok/s=115.6
  creative_short     pred= 192 draft= 276 acc= 162 rate=0.587 tok/s=53.3
  stepwise_math      pred= 192 draft= 172 acc= 140 rate=0.814 tok/s=49.0
  long_code_review   pred= 192 draft= 222 acc= 148 rate=0.667 tok/s=58.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1779,
  "total_draft_accepted": 1302,
  "aggregate_accept_rate": 0.7319,
  "wall_s_total": 34.4
}

[-]

ANTONBORODA@reddit

Same question. What draft model do you use for Qwen 3.6 27B? I tried smaller MoE 3.6 models but they don't work.

[-]

An_Original_ID@reddit

Qwen 3.5 2B and 0.8B. 2B runs the same speed as 0.8B so that's the one I typically use at Q8.

I've also tried Qwen 2.5 Coder 1.5B and got similar results (speed and acceptance rate) as 3.5 2B.

[-]

ANTONBORODA@reddit

Thanks! Thought that they are incompatible.

[-]

An_Original_ID@reddit

I think as long as the model is from the same family, i.e. similar vocabulary, then they should work okay.

Predictable Text = code, summarizing text ect

Less predictable = chat, role play ect.

[-]

thirteen-bit@reddit

I've just tried small 3.5 Qwens as draft models for 3.6 27B (Q4_K_M with MTP by unsloth, all test with kv caches at Q8_0/Q8_0) on single RTX3090.

Downloaded 4-bit quants of Qwen-3.5-0.8B (bartowski Q4_K_M) and Qwen-3.5-2B (bartowski IQ4_XS).

These actually work as drafts for Qwen 3.6 27B, draft acceptance is around 75% in short test (https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090) but tk/s is slightly lower (34-37 tps for 2B and 34-38 for 0.8B) than running Qwen 3.6 27B without any spec-type at all (37-38 tps).

Compared to that MTP is much better, 50-56 tps. But it was a pain to set up, VRAM requirements are high. Run all tests with ctx-size 16384.

ngram-mod if it hits the cache is 70-75, otherwise the same 37-38 as no draft.

Probably will use ngram-mod for coding with large context, MTP for smaller contexts.

[-]

AppealSame4367@reddit

Last comments hint to: no.

ik_llama has a new commit where they introduce multi-speculative decoding, something like ngram + mtp by the way. Also some performance problems as well, that kill of some of the mtp speedup. But last i checked i had more speed with 27B MTP on ik_llama than llama.cpp MTP PR.

[-]

No_Algae1753@reddit

What's the point then of mtp if the pp is half the speed?? The speed gain is basically useless if you have a lot of pp going on.

[-]

Borkato@reddit

Some of us don’t have very much pp man ☹️

[-]

AppealSame4367@reddit

I for one have more than average pp and I wouldn't like it if MTP decreased it. :-|

[-]

Borkato@reddit

Damn dude can you send me some weights? Sounds like a hefty model 👀

[-]

No_Algae1753@reddit

I swear I'm from now on I'll just write it out 😓

[-]

Borkato@reddit

No please it’s so funny lol

[-]

AppealSame4367@reddit

See the edit of my last comment please.

[-]

remeh@reddit

It's also the information I'm looking after, but I can't find anything really conclusive

[-]

Alarmed_Wind_4035@reddit

how much vram will it take?

[-]

Pjotrs@reddit (OP)

Same. Ish.

Its the processing, checkout MTP models sizes.

[-]

dampflokfreund@reddit

No, MTP takes significantly more VRAM. I'm not sure why you would think it'd be same-ish.

[-]

SarcasticBaka@reddit

It doesn't at all for me, using Unsloth's Qwen3.6-27B-UD-Q4_K_XL quant with and without MTP the difference in vram usage is about 300mb.

[-]

TheTerrasque@reddit

A bit over 1gb extra for me. Same model, with 100k context q8.

23094MiB with MTP on
21864MiB with MTP off

[-]

DanInVirtualReality@reddit

Looks like about an extra 5GB for unsloth/Qwen3.6-35B-A3B-MTP-GGUF at 1M context according to https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4465608286

But 3GB should be saved by the demonstrated improvement, once that's been submitted and also gone through a PR, which I hope it will shortly.

Depends on your definition of same-ish tbh

[-]

eblanshey@reddit

Man, compare model sizes for 122B:

Original: https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF

With MTP: https://huggingface.co/unsloth/Qwen3.5-122B-A10B-MTP-GGUF

Q4 is literally double the size.

[-]

UmpireBorn3719@reddit

I tested today, It takes a lot vram.

[-]

DanInVirtualReality@reddit

A handful of GB - see this comment on the thread at https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4465608286 - I'm watching this user for their next contribution as I notice the VRAM usage improvement they've suggested cos their working method was asked to be submitted separately.

I'm too close to the edge of my VRAM usage to worry about this until that's done, but the method is proven so we shouldn't need to wait to long.

[-]

qfox337@reddit

https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF says --mmproj (multimodal) is not supported, is that out of date?

[-]

alecmuffett@reddit

Is there a typo in the MTP documentation from Unsloth?

Here:

https://unsloth.ai/docs/models/qwen3.6#mtp-guide

MTP Qwen3.6-27B: Thinking mode: Please see Qwen3.6's new Preserved Thinking. General tasks: Copy export LLAMA_CACHE="unsloth/Qwen3.6-27B-MTP-GGUF" ./llama.cpp/llama-cli \ -hf unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL \ <<<<<< no mention of MTP ... ...

[-]

Terminator857@reddit

I'm getting 12 t/s on strix halo. Was getting 4-5 tokens per second without mtp. Command line for server:

\~/github/llama.cpp/build/bin/llama-server -m \~/llms/qwen3/6/mtp-27B-UD-Q8_K_XL.gguf --spec-type draft-mtp --spec-draft-n-max 3 -ngl 999 -c 256000 -fa on -ctk q8_0 -ctv q8_0 --no-mmap --temp 0

Prompt: origin of the word pot , similar to the word kettle

Stats:
prompt eval time = 854.09 ms / 21 tokens ( 40.67 ms per token, 24.59 tokens per second)
eval time = 175829.95 ms / 2093 tokens ( 84.01 ms per token, 11.90 tokens per second)
total time = 176684.04 ms / 2114 tokens
draft acceptance rate = 0.66097 ( 1392 accepted / 2106 generated)
4.02.870.898 I statistics draft-mtp: #calls(b,g,a) = 1 702 702, #gen drafts = 702, #acc drafts = 579, #gen tokens = 2106, #acc tokens = 1392, dur(b,g,a) = 0.008, 40884.759, 1.094 ms

[-]

u23043@reddit

on my strix halo i saw 7.6 -> 18.0 on the 27B Q8_0 with a coding prompt, and 11.0 -> 22.7 on the 27B Q5_K_M. performance on both peaked with --spec-draft-n-max 4 for me, but this may be prompt dependent

[-]

xxfatumxx@reddit

Can somebody make a PR for homebrew llama.cpp formula version to bump version to 9180 please? I know there’s a person who does this on regular basis but I don’t know who. In repo I see only BrewTestBot.

[-]

No_Block8640@reddit

Does it get automatically downloaded with lm studio? Or if I use lm studio I have to wait until they update their app?

[-]

uncoolcat@reddit

If you are using LM Studio, at a minimum they will need to add support for the updated llama.cpp engine runtimes. The LM Studio application itself might need an update as well.

Runtimes can update automatically, but I think the application itself just prompts when a new version is available. Either way, you can check manually by going to settings > general for application updates, and settings > runtime for runtime updates. Beta channel updates might have the update faster, but as of now there isn't an update available for this. New runtimes in LM Studio aren't available immediately, and can take a bit (days to possibly even weeks in some cases).

[-]

relmny@reddit

Nice!,
so MTP took a few days while turboquant it still not there, I can't stop thinking that they don't really take turboquant too seriously?

[-]

sonicnerd14@reddit

Yeah, I dont get it. It's proven that the benefit is that we get more ctx with practically very little memory cost. Just as useful as MTP if you ask me to get merged in. Seems there's just too much misinformation going on about it that's creating confusing on its usefulness.

[-]

dodiyeztr@reddit

Try weeks

[-]

JGeek00@reddit

What’s the trade off on using MTP?

[-]

RnRau@reddit

Slightly more vram usage from my understanding.

[-]

sonicnerd14@reddit

That and also there doesn't seem to be any fork that supports native image input yet with MTP on. Unless I've been setting something up wrong.

[-]

AppealSame4367@reddit

And slower prompt processing. Even slower tg on 35B (MoE), at least ik_llama had that problem with it.

[-]

Limp_Classroom_2645@reddit

Than should be fixed with further optims

[-]

coherentspoon@reddit

are you guys seeing better performance with spec-draft-n-max 2 or 3?

[-]

unjustifiably_angry@reddit

Where are things on dflash?

[-]

redaktid@reddit

Gguf now!

[-]

Pjotrs@reddit (OP)

They are waiting...

[-]

AppealSame4367@reddit

I don't get it. There already are unsloth MTP gguf. They even hint to the new param syntax

[-]

redaktid@reddit

Bad tense, meant like not gguf when gguf now lol

[-]

SkyFeistyLlama8@reddit

Confirmed, Unsloth Qwen 3.6 35B MOE MTP works. I built llama-server with all latest commits and I'm getting 1.8x token generation speedup, with the MXFP4 MOE GGUF.

[-]

soyalemujica@reddit

Gave this a try with Vultan + 7900XTX , however, the speed is nice, like 27B at 50t/s up to 90t/s, however, it did reduce my context size to 110k with Q4 and 50k with Q5 model

[-]

DiscipleofDeceit666@reddit

I got this working on my RDNA2 setup. prompt read and prompt write speeds. RX 6800 + rx 6700xt. 28 Gb Vram. The 27B finally reaches usable speeds too. From \~10 tok/s on windows all the way to \~22 tok/s with MTP on rocm. The gains are real!

[-]

Defiant-Morning4442@reddit

damn finally this is happening... let's goo

[-]

DiscipleofDeceit666@reddit

I built this from source and saw massive improvements. 3 days ago, qwen3.6 moe was happily running at 30-40 tok/s vulkan. Today, I hit 70 with this and rocm.

Qwen 27B jumped from 10-15 tok/s to 20-25tok/s vulkan to this.

Fucking stoked.

[-]

a_beautiful_rhind@reddit

So you guys didn't just merge the PR prior? I mean it's really easy to try things out.

[-]

Rikers88@reddit

Waiting for the turboquant one as well! Is this also related to the dflash?

[-]

dzedaj@reddit

now when turboquant

[-]

El_90@reddit

Sincere question, and I'm incredibly grateful for all the work everyone has done, thank you.

Am I misunderstanding why people refer to this as a game changer? Pp is no faster, and TG is 1.5-2 faster, so overall this is a 20% improvement? The difference between models is greater than this, so why is 20 a game changer? It's amazing and I love it, but it's not an order of magnitude change ??. I think I'm overlooking an angle.... :)

[-]

crantob@reddit

1.5x = 50% faster

2.0x = 100% faster

[-]

El_90@reddit

But if TG is half the process (e.g. PP) you need to half that as an overall improvement

[-]

nickm_27@reddit

It's not hal especially with prompt caching, it depends on what you're using the model for

[-]

ethertype@reddit

For the math nerds: does 2x performance count as an order of magnitude when counting in binary?

[-]

ImpressiveSuperfluit@reddit

Per definition, yes.

[-]

Maleficent-Ad5999@reddit

I’m no expert.. I’m speaking solely from my experience with Qwen 3.6 27B model, the best available local model which is now my daily driver.

Previously it used to generate at 50-55 tps with llamacpp which is usable for chat.. but when I heard about MTP, I tried vllm for the first time and I was blown away to see 110-120tps which is soo good and enabled me to do that proper agentic coding.

I even finished a complete project with just this setup and a single 5090.

The token generation speed is nearly 2x now.

I think the main reason why it is a game changer is the architecture.. generating a new token is costly. But predicting the next couple of token while generating the first token and then validating is cheaper.. saves a lot of compute.. i came across this comment: “calculating a factors of a number is labour intensive but verifying a factor is cheaper and quick”

[-]

Xonzo@reddit

50% to 100% faster is a HUGE upgrade….? Could be an entirely different hardware class. Like my 5070 ti / 5060 ti going from 1100 PP / 30 TG to 1100 / 60 is huge on Qwen 27B. For Agentjc use I really start to feel it when PP is < 1,000 and TG < 40.

[-]

chocofoxy@reddit

now turboQuant please

[-]

dampflokfreund@reddit

Have they fixed the slower prompt processing?

[-]

soyalemujica@reddit

I don't think so, atm I am at 160t/s with a 7900XTX

[-]

erm_what_@reddit

For what model though

[-]

soyalemujica@reddit

Qwen 27B

[-]

Pristine-Tax4418@reddit

Is Qwen3.5-122B supported?

[-]

Pjotrs@reddit (OP)

Check Unsloth's GGUFs , it is there

[-]

iamapizza@reddit

# [ADVANCED]
# combine MTP + ngram-* (experimental, suitable for non-CUDA systems) 
# use these combinations only if you know what you are doing

Any idea what this is implying, why wouldn't we use ngram + MTP together on NVidia GPUs?

[-]

CockBrother@reddit

Try it. You'll probably find out what the warning is about fairly quickly. He's saying it's not reliable, not accurate, or both.

Sounds pretty great to be able to use two strategies for token prediction at the same time though.

[-]

iamapizza@reddit

Ok will do thanks. It does sound like a great combo if this could work. Interested to find out why they specifically called out cuda here. Hope it isn't some huge limitation

[-]

nickm_27@reddit

I believe I saw some comments about it. Just an issue with CUDA memory transfers so it might be slower with both enabled. Fixing this is on their post merge to-do list

[-]

gyzerok@reddit

It is implying that people like yourself should not use these combinations

[-]

iamapizza@reddit

It is implying that people like yourself should not use these combinations

Sorry, just trying to learn

[-]

I am a bot and this action was performed automatically.

[-]

add --no-mmproj to disable vision support if not needed (uses less memory)

add `--no-mmproj` to disable vision support if not needed (uses less memory)