That's a good news...
Posted by Pjotrs@reddit | LocalLLaMA | View on Reddit | 206 comments
Looks like it finally happens... MTP getting approved for llama.cpp.
Time to prepare for the update.
bernzyman@reddit
Has the vision fix been included? MTP is designed to be compatible with vision but a bug had been preventing it from working. Dont know if the fix has been merged upstream also?
sonicnerd14@reddit
I hope. Haven't heard enough people talk about this. Definitely something that should be rectified asap.
bernzyman@reddit
I won’t have a chance to compile this latest build today to check. But I have a custom build which includes the fix and so I know it works. Great if upstream incorporates MTP and the vision fix
StorageHungry8380@reddit
It was reported in the other thread that the PR says it's supported.
bernzyman@reddit
That note was there (in the notes of the MTP enabled GGUF’s on HF) when the bug was in effect. Its stating how vision should be compatible with MTP
FatheredPuma81@reddit
So uhh... what am I doing wrong? Seems like MTP is only useful for under 30k context single slot inputs...
Qwen3.6 35B 1 slot fresh context: 185t/s (150t/s without MTP)
Qwen3.6 35B 1 slot 40k context: 100t/s (135t/s without MTP)
Qwen3.6 35B 2 slot 40k context: 50t/s (95t/s without MTP)
Qwen3.6 27B 1 slot fresh context: 90t/s (50-ish without MTP)
Qwen3.6 27B 1 slot 40k context: 50t/s (45t/s without MTP. 57t/s spec-draft-n-max = 2 instead of 3)
Qwen3.6 27B 2 slot 40k context: 30t/s (36t/s without MTP. 28t/s spec-draft-n-max = 2 instead of 3)
FullstackSensei@reddit
Link to the PR: https://github.com/ggml-org/llama.cpp/pull/22673
Limp_Classroom_2645@reddit
Merged
Glad_Claim_6287@reddit
u/yags-lms LM studio when
cafedude@reddit
LM Studio uses llama.cpp, however, I doubt they have the spec=mtp option in their GUI. That's something they'd need to add.
theUmo@reddit
Good luck figuring out which version they're running
FullstackSensei@reddit
Wrappers tend to be quite behind in adding features. There's also a non-zero chance they'll screw adding it up in a way that makes it useless for anything and anyone besides whatever happy flow they thought of.
Glad_Claim_6287@reddit
Yeah that's my question man.
Limp_Classroom_2645@reddit
Whenever
f4nt4@reddit
Build pipeline for that PR/release: https://github.com/ggml-org/llama.cpp/actions/runs/25961507493
FullstackSensei@reddit
Real men pull master and build from source! /s
robertpro01@reddit
Lol, I am not a real men anymore, I'm just another vibe coder at this point
tempedbyfate@reddit
It literally takes me 40 seconds to download and build the repo. just put together a basic shell script that does the clone/pull from master branch and build with your desired args.
Deep90@reddit
Plus you can literally use an llm to build it for you if you don't know how.
LocoMod@reddit
Builds fast on an M-series Mac. On a top tier PC it takes forever.
BigPoppaK78@reddit
Forever? I can build a whole new podman image (with CUDA) in a few minutes. Running Fedora on a 7950x3d.
FullstackSensei@reddit
But dude might have been comparing to that $500 laptop with an i3 form 15 years ago, because as we all know, technology only advances in fruity land
snmnky9490@reddit
And the "average" Apple laptop being compared even for like basic consumer use is often $2000-3000 vs that 2012 Windows one that was the cheapest thing Bestbuy had in stock.
robertpro01@reddit
Yeah, I have an script for compiling it as well
ionizing@reddit
I even added a menu entry to trigger the shell script from my app. "Server -> Rebuild llama.cpp" checks if there is a new release then gets it and refreshes local build.
Deep90@reddit
Real men pull the PR branch and build from source 😜
Borkato@reddit
Wait are those basically replacing git pull && git checkout latest-branch-thingy && cmake whatever && cmake cuda build?
0-0x0@reddit
Comfortable-Rock-498@reddit
Georgi Gerganov has done more to improve the world than most if not all AI CEOs
mdziekon@reddit
It's not just Georgi, don't forget all the contributors who provide their support either in code or testing. This PR was created by Aman Gupta.
Comfortable-Rock-498@reddit
You're right. A large number of contributors who never get recognition. Come to think of it, not a bad idea to create a recognition portal that automatically fetches all OSS contributions for the purpose of recognition
I think shit like Forbes 30 under 30 is coveted only because we do not have means to recognize and reward contributions to the community, and most people want to be recognized for and feel proud of their work
relmny@reddit
Taking into account that GG/llama.cpp probably gets almost no recognition from "big" companies/projects/media/etc because of crap like ollama, the contributors get below zero... it's like the 1% of the 1%...
Fuck ollama and the likes! long live llama.cpp!
Ok_Scientist_8803@reddit
Llama.cpp is complicated* but it gets you everything. Ollama is supposed to be the easiest, but LMstudio seems to be even easier, yet not hiding the fact that it uses llama.cpp.
*Compared to GUI only methods that a non tech savvy person could use.
I believe ollama used to top Google search results for something like "run ai on my computer", but nowadays it's further down.
Imaginary-Unit-3267@reddit
You know you've become a true Linux nerd when "llama.cpp is complicated, GUIs are not" has you scratching your head and thinking "but you have to click all those buttons in a GUI! with llama.cpp you just type in a few options and flags! it's ez!" Sigh...
ShaneBowen@reddit
Llama.cpp is the ffmpeg of AI.
Comfortable-Rock-498@reddit
True about Ollama. I recently built a coding agent which started as a Cline fork. I removed Ollama support just on principle
tiffanytrashcan@reddit
https://i.redd.it/iezmjnzg9i1h1.gif
InsensitiveClown@reddit
You know what would really be helpful? Not a recognition portal, but a bounty portal. You want MTP or Turboquant, Rotoquant? Sure, that's great, but implementing features, doing fixes, implies time. Unpaid. Away from your own family, your children. And in the end, most people just feel entitled and spam everything demanding, not asking, demanding, feature X, Y, Z. Software development costs time, effort, and we're all humans. I'm not directing this at you in particular. It's a notorious problem. For example, the developer(s) of curl, or FFMPEG. Their projects are used everywhere, and yet, as critical parts of infrastructure, what do they have, monetarily, that can stabilize them in order to work full time on it? These particular cases are relatively extreme and stabilized, but there are many such cases where corporations, in in some case users, just take, demand, and give nothing in return.
It is an unequal situation, that generates high attrition, burnout, and inevitably results in excellent developers with a good overview of a codebase, leaving because not leaving has a tremendous personal, professional, economical cost.
Imaginary-Unit-3267@reddit
I agree with the other commenters that this would lead to bad incentives. However, there is an alternative. You can donate money to maintainers of your favorite software projects. :)
raikounov@reddit
Your intentions are good, but that's how you get the maintainers burnt out reviewing hundreds of slop PRs trying to get the bounty.
Eisenstein@reddit
Sounds like a solution ripe with unintended consequences. Once you monetize something, you get a completely different incentive structure that the one you had. This leads to effects some of which can be predicted and some which cannot.
dnsod_si666@reddit
How would you ensure a bounty has been completed before paying out? Or would it be on the bounty-setter to decide and then if they have a bad reputation of not paying fairly, devs won’t take their bounties?
InsensitiveClown@reddit
If the bounty is, implement X feature to Y app, with Z reward, then if someone accepts, they would need to submit a PR, accept the reviews, and when it is merged, it is done, or at least when there are no further reviews and is pending for acceptance. That's clearly done, the code is public, PR submitted, reviewed. There are escrows, networks for FLOSS bounties, see: https://wiki.p2pfoundation.net/Crowdfunding_for_Free_Software_and_Free_Hardware_Projects
tmflynnt@reddit
Hmmm interesting ideas.. Just spitballing here: I wonder if something like this could be in kind of an escrow setup and allow for others to join the bounty too?
Maybe below a certain threshold the benefit of the doubt would be given to established devs for when they claim completion (probably with a basic ai verification backing it) and it could get paid out automatically in such cases. And maybe above a certain $$ threshold or where there is a verification problem or appeals happen, then an independent human reviewer could be involved (with a small cut going toward keeping the overall system running)?
am17an@reddit
Yup!
Comfortable-Rock-498@reddit
More people should read this comment \^
gh0stwriter1234@reddit
Actually the opposite is true because just like reddit karma farming people farm contribution list attribution on github.... in fact the guy that did the MTP PR was helping work on dealing with this issue also in llama.cpp itself by blocking trivial commits as your first commit to llama.cpp the idea is they'd rather have more substantial contributors rather than just 1000 people "fixing" typos just to get on the list.
Plabbi@reddit
That makes no sense. Then we wouldn't have any models and llama.cpp would be useless.
Comfortable-Rock-498@reddit
Plabbi@reddit
No I'm not. Researchers have no authority to decide to publish the open weights.
AnOnlineHandle@reddit
Who do you think does 99.999% of the actual critical work, the person with the relevant education who shows up day after day and makes it happen, or the person who says they can release the weights after and gets the credit?
Plabbi@reddit
That has absolutely nothing to do with it. The CEO makes these decisions, or even the board.
You can not independently decide to open source the software that your company is producing, no matter if you do 50% or 100% of the work.
what has this got to do with anything?
AnOnlineHandle@reddit
Uh huh.
m7l5@reddit
I see your point. I actually give one good point to Zuckerberg for being the one who triggered that initiative. Even if the reason for them made sense economically at that time.
More-Curious816@reddit
it was Yann LeCun, he was the chef ai scientist at Facebook and the one who shaped Facebook open source culture. he left Facebook now and you can see that Zuckerberg and his folk immediately decided to not release their new models.
LumpyWelds@reddit
Crap.. I was deluded. Thanks for clearing that up!
crantob@reddit
DingDingDing. Remember and respect the name Yann LeCun.
MuDotGen@reddit
Just tried with Intel Arc 140V with Windows Vulkan and Qwen3.6-35B-A3B-MTP IQ4_K_XS, and I was seeing worse speeds, n=2 being best, but worse than single pass. I know this is best for Nvidia GPUs but thought I'd try it nonetheless.
xjE4644Eyc@reddit
Yes, the t/s improved, but the prompt processing decreased. The overall time to token completion was slower with MTP enabled, at least with Strix Halo. I'll post my results in a bit.
ElementNumber6@reddit
MTP benefits are very situation-dependent.
Task-Dependent Efficiency: MTP thrives on low-entropy tasks with rigid syntax (e.g., coding) where the acceptance rate for multi-token guesses is high. For creative writing or complex tool calls, acceptance rates plummet, meaning the extra compute spent generating drafts is wasted.
Training Complexity: The model requires specialized pre-training or fine-tuning. Generating multiple tokens at once requires complex architectures to prevent grammatical mismatches.
KV Cache Overhead: Predicting deeper MTP steps (e.g., MTP > 3) requires storing and evaluating significantly larger parallel hidden states. This KV cache overhead can eat into your context capacity on smaller hardware setups.
Poor Fit for MoE: While MTP works beautifully on dense models, Mixture of Experts (MoE) architectures struggle with MTP heads, often yielding little to no inference speedup in practice
spaceman3000@reddit
Yes this is the issue with MTP right now and strix halo is painfully slow when it comes to PP even without MTP due to low GTT bandwidth so I'll pass for now.
Goldandsilverape99@reddit
Is support for mmproj/vision included when using mtp?
DoorStuckSickDuck@reddit
Also doesn't support parallel I think
Borkato@reddit
What’s the point of parallel if you’re the only one using it? I’ve never understood that
huzbum@reddit
So if there is more than one context it doesn't blow out your cache and have to re-process everything. This could be agents with sub-agents, or just a UI that does summaries or something.
I use multiple tools, and some of those tools use multiple requests/contexts. Like IntelliJ IDEA AI chat runs like 6+ parallel requests, so I set parallel to 8 and have it cached to system memory with `--cache-ram`.
Otherwise, if you have a super long conversation, it should only have to process the new message, but if any other request comes in, it blows out the cache and has to re-process the entire conversation. It's the difference between less than 1 second to first token and like 10+ seconds.
Borkato@reddit
This is super helpful thank you!
nickm_27@reddit
Having multiple cache slots so different workloads all have their own cache.
Borkato@reddit
Uh what does that mean? Like agents?
BigPoppaK78@reddit
Yeah, that's one possibility. Agents and sub-agents - so, the ability to delegate tasks and run parallel/background processes that don't bloat up your main context with irrelevant text.
But, if you're also the type of person to have multiple chats going, like one main conversation and then multiple smaller ones where you ask quick one-off questions, this helps with responsiveness.
Borkato@reddit
Oh! I didn’t know that, this makes a lot of sense actually thank you’
nickm_27@reddit
That was also fixed
Endlesscrysis@reddit
One of the first lines is
Tip
MTP is compatible with Vision input.
audioen@reddit
I have tested it on the PR and it works just fine.
What didn't work is combination of e.g. draft-mtp and ngram-simple, as it seems to disable some recurrent state fallback gizmo. I want something like ngram-simple so that whenever the model is reciting code or its own reasoning, it can go ultra-fast.
iamapizza@reddit
Looking at the example they gave
That seems to imply it does support it, and you'd pass the flag to disable it? Or am I reading too much into it
alew3@reddit
getting 105-110 token/s with unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-Q8_0.gguf on an RTX 5090
dodiyeztr@reddit
Context window: 20k
Deep90@reddit
Why would the context window be 20k?
They are correct. I've been running it for like a week or so now at full context because I had the sense to just build the PR branch.
alew3@reddit
260K, spilling over to RAM. Prompt processing gets very slow though
fredandlunchbox@reddit
Add tq4?
coherentspoon@reddit
is the Q8 really worth it over like Q4 or Q5?
Icy_Butterscotch6661@reddit
So says the rumors
UmpireBorn3719@reddit
what is your max ctx size tho?
alew3@reddit
260K, spilling over to RAM. Prompt processing gets very slow though .. going to try Q5
crantob@reddit
useful. ty. what were you getting without MTP on the same hardware?
RoutineProperty7061@reddit
The prefill degradation has been fixed?
spaceman3000@reddit
No
spacenavy90@reddit
LM Studio support when?
dave-tay@reddit
Sweet, 40 t/s with Qwen3.6-35B-A3B-UD-Q4_K_M
StephenSRMMartin@reddit
I went from 28 to 48 t/s on my AMD 6700xt on ROCm. Awesome.
Pjotrs@reddit (OP)
And before? On 16GB vram I get 45-50 on 4060 and 55-60 on 5070.
Without MTP.
dave-tay@reddit
Before was 23 t/s with Qwen3.6-35B-A3B-UD-Q6_K_XL.gguf on RTX 5060ti 16gb
whoisraiden@reddit
What server flags or INI keys do you run it with?
dave-tay@reddit
```
llama-server \
--model \~/llm/qwen/3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8081 \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--spec-draft-ngl 99 \
--ctx-size 131072
```
Qwen3.6-35B-A3B-UD-Q4_K_M: 47 t/s
Qwen3.6-35B-A3B-UD-Q6_K_XL: 29 t/s
whoisraiden@reddit
Okay thank you
Pjotrs@reddit (OP)
That is crazy jump.
Limp_Classroom_2645@reddit
So 30% more performance on 3090
GlobalLadder9461@reddit
For vulkan backend AMD APU I am observing at max 30% increase in speed. What are the results from other vulkan folks.
StephenSRMMartin@reddit
I didn't see as big a speed up on Vulkan on my 6700xt. But on ROCm it was nearly 2x faster on qwen 3.6 moe. Just awesome.
u23043@reddit
best result I had on Qwen2.6-27B was 2.2x decode (vulkan, strix halo). 35BA3B was more like 1.25x
Icy-Roll-4044@reddit
Congratulations
No_Algae1753@reddit
Have they fixed the slow pp ?
JazzlikeLeave5530@reddit
My gf also asked me that the other night 😔
JayPSec@reddit
Thank you for this early Sunday laugh 😂️
runcertain@reddit
It’s so early that it’s Saturday
JayPSec@reddit
Sunday is whenever a man wants! 😏️
jeremyckahn@reddit
Man, AI people really do live in the future
Borkato@reddit
I know it’s fucking immature but these jokes about pp always get me
tomz17@reddit
The prefill with MTP is always going to be slower since it requires multiple forward passes. This is doubly extra true with multiple cards linked over a slow interface (e.g. PCIE).
Even with VLLM + nvlink, I still disable MTP for agentic workflows, as the gains from faster generation are almost immediately lost on the prompt processing penalty.
An_Original_ID@reddit
Instead of using --spec-type mtp, I have 2.5x prompt processing when using a draft model.
Anecdotal Results: MTP = 600 PP Draft Model = 1400 PP Neither = 2200 PP
MTP: Less Predictable Text = 35 TKs Draft: Less Predictable Text = 22 TKs
MTP: Predictable Text = 45 TKs Draft: Predictable Text = 50 - 80 TKs (depending on n predict)
TLDR: Long prompt + Predictable output = Draft model Long prompt + Less predictable output = MTP Short prompt + Predictable output = either Short Prompt + Less predictable output = MTP
For my rag setup, I use Draft instead of "mtp".
Qwen 3.6 27B Q8 on 2 x 3090 250 Watt limit YMMV gl hf gg no re
thirteen-bit@reddit
What is the draft model compatible with Qwen 3.6 27B?
Qwen 3.5 0.8B?
Or do you mean ngram-mod or similar? These do not require a model.
An_Original_ID@reddit
Qwen 3.5 2B and 0.8B. 2B runs the same speed as 0.8B so that's the one I typically use at Q8.
I've also tried Qwen 2.5 Coder 1.5B and got similar results (speed and acceptance rate) as 3.5 2B.
thirteen-bit@reddit
I see, tried it on a single RTX3090 and got no speedup, but I've run everything in Q4 and on a single card, that's probably the reason there was no improvement.
An_Original_ID@reddit
Could be a few different things. What's your acceptance rate and how many Tokens are you trying to predict? Also, high temp means lower acceptance rate.
thirteen-bit@reddit
Used MTP test script from the PR itself, so just 192 tokens/request.
Will test with longer prompts and larger context, thank you for the idea. By the way, combination of draft model and ngram is working and even for 192 tokens, check second and third run tps:
ANTONBORODA@reddit
Same question. What draft model do you use for Qwen 3.6 27B? I tried smaller MoE 3.6 models but they don't work.
An_Original_ID@reddit
Qwen 3.5 2B and 0.8B. 2B runs the same speed as 0.8B so that's the one I typically use at Q8.
I've also tried Qwen 2.5 Coder 1.5B and got similar results (speed and acceptance rate) as 3.5 2B.
ANTONBORODA@reddit
Thanks! Thought that they are incompatible.
An_Original_ID@reddit
I think as long as the model is from the same family, i.e. similar vocabulary, then they should work okay.
Predictable Text = code, summarizing text ect
Less predictable = chat, role play ect.
thirteen-bit@reddit
I've just tried small 3.5 Qwens as draft models for 3.6 27B (Q4_K_M with MTP by unsloth, all test with kv caches at Q8_0/Q8_0) on single RTX3090.
Downloaded 4-bit quants of Qwen-3.5-0.8B (bartowski Q4_K_M) and Qwen-3.5-2B (bartowski IQ4_XS).
These actually work as drafts for Qwen 3.6 27B, draft acceptance is around 75% in short test (https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090) but tk/s is slightly lower (34-37 tps for 2B and 34-38 for 0.8B) than running Qwen 3.6 27B without any spec-type at all (37-38 tps).
Compared to that MTP is much better, 50-56 tps. But it was a pain to set up, VRAM requirements are high. Run all tests with ctx-size 16384.
ngram-mod if it hits the cache is 70-75, otherwise the same 37-38 as no draft.
Probably will use ngram-mod for coding with large context, MTP for smaller contexts.
AppealSame4367@reddit
Last comments hint to: no.
ik_llama has a new commit where they introduce multi-speculative decoding, something like ngram + mtp by the way. Also some performance problems as well, that kill of some of the mtp speedup. But last i checked i had more speed with 27B MTP on ik_llama than llama.cpp MTP PR.
No_Algae1753@reddit
What's the point then of mtp if the pp is half the speed?? The speed gain is basically useless if you have a lot of pp going on.
Borkato@reddit
Some of us don’t have very much pp man ☹️
AppealSame4367@reddit
I for one have more than average pp and I wouldn't like it if MTP decreased it. :-|
Borkato@reddit
Damn dude can you send me some weights? Sounds like a hefty model 👀
No_Algae1753@reddit
I swear I'm from now on I'll just write it out 😓
Borkato@reddit
No please it’s so funny lol
AppealSame4367@reddit
See the edit of my last comment please.
remeh@reddit
It's also the information I'm looking after, but I can't find anything really conclusive
Alarmed_Wind_4035@reddit
how much vram will it take?
Pjotrs@reddit (OP)
Same. Ish.
Its the processing, checkout MTP models sizes.
dampflokfreund@reddit
No, MTP takes significantly more VRAM. I'm not sure why you would think it'd be same-ish.
SarcasticBaka@reddit
It doesn't at all for me, using Unsloth's Qwen3.6-27B-UD-Q4_K_XL quant with and without MTP the difference in vram usage is about 300mb.
TheTerrasque@reddit
A bit over 1gb extra for me. Same model, with 100k context q8.
DanInVirtualReality@reddit
Looks like about an extra 5GB for unsloth/Qwen3.6-35B-A3B-MTP-GGUF at 1M context according to https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4465608286
But 3GB should be saved by the demonstrated improvement, once that's been submitted and also gone through a PR, which I hope it will shortly.
Depends on your definition of same-ish tbh
eblanshey@reddit
Man, compare model sizes for 122B:
Original: https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF
With MTP: https://huggingface.co/unsloth/Qwen3.5-122B-A10B-MTP-GGUF
Q4 is literally double the size.
UmpireBorn3719@reddit
I tested today, It takes a lot vram.
DanInVirtualReality@reddit
A handful of GB - see this comment on the thread at https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4465608286 - I'm watching this user for their next contribution as I notice the VRAM usage improvement they've suggested cos their working method was asked to be submitted separately.
I'm too close to the edge of my VRAM usage to worry about this until that's done, but the method is proven so we shouldn't need to wait to long.
qfox337@reddit
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF says --mmproj (multimodal) is not supported, is that out of date?
alecmuffett@reddit
Is there a typo in the MTP documentation from Unsloth?
Here:
https://unsloth.ai/docs/models/qwen3.6#mtp-guide
MTP Qwen3.6-27B: Thinking mode: Please see Qwen3.6's new Preserved Thinking. General tasks: Copy export LLAMA_CACHE="unsloth/Qwen3.6-27B-MTP-GGUF" ./llama.cpp/llama-cli \ -hf unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL \ <<<<<< no mention of MTP ... ...
Terminator857@reddit
I'm getting 12 t/s on strix halo. Was getting 4-5 tokens per second without mtp. Command line for server:
\~/github/llama.cpp/build/bin/llama-server -m \~/llms/qwen3/6/mtp-27B-UD-Q8_K_XL.gguf --spec-type draft-mtp --spec-draft-n-max 3 -ngl 999 -c 256000 -fa on -ctk q8_0 -ctv q8_0 --no-mmap --temp 0
Prompt: origin of the word pot , similar to the word kettle
Stats:
prompt eval time = 854.09 ms / 21 tokens ( 40.67 ms per token, 24.59 tokens per second)
eval time = 175829.95 ms / 2093 tokens ( 84.01 ms per token, 11.90 tokens per second)
total time = 176684.04 ms / 2114 tokens
draft acceptance rate = 0.66097 ( 1392 accepted / 2106 generated)
4.02.870.898 I statistics draft-mtp: #calls(b,g,a) = 1 702 702, #gen drafts = 702, #acc drafts = 579, #gen tokens = 2106, #acc tokens = 1392, dur(b,g,a) = 0.008, 40884.759, 1.094 ms
u23043@reddit
on my strix halo i saw 7.6 -> 18.0 on the 27B Q8_0 with a coding prompt, and 11.0 -> 22.7 on the 27B Q5_K_M. performance on both peaked with --spec-draft-n-max 4 for me, but this may be prompt dependent
xxfatumxx@reddit
Can somebody make a PR for homebrew llama.cpp formula version to bump version to 9180 please? I know there’s a person who does this on regular basis but I don’t know who. In repo I see only BrewTestBot.
No_Block8640@reddit
Does it get automatically downloaded with lm studio? Or if I use lm studio I have to wait until they update their app?
uncoolcat@reddit
If you are using LM Studio, at a minimum they will need to add support for the updated llama.cpp engine runtimes. The LM Studio application itself might need an update as well.
Runtimes can update automatically, but I think the application itself just prompts when a new version is available. Either way, you can check manually by going to settings > general for application updates, and settings > runtime for runtime updates. Beta channel updates might have the update faster, but as of now there isn't an update available for this. New runtimes in LM Studio aren't available immediately, and can take a bit (days to possibly even weeks in some cases).
relmny@reddit
Nice!,
so MTP took a few days while turboquant it still not there, I can't stop thinking that they don't really take turboquant too seriously?
sonicnerd14@reddit
Yeah, I dont get it. It's proven that the benefit is that we get more ctx with practically very little memory cost. Just as useful as MTP if you ask me to get merged in. Seems there's just too much misinformation going on about it that's creating confusing on its usefulness.
dodiyeztr@reddit
Try weeks
JGeek00@reddit
What’s the trade off on using MTP?
RnRau@reddit
Slightly more vram usage from my understanding.
sonicnerd14@reddit
That and also there doesn't seem to be any fork that supports native image input yet with MTP on. Unless I've been setting something up wrong.
AppealSame4367@reddit
And slower prompt processing. Even slower tg on 35B (MoE), at least ik_llama had that problem with it.
Limp_Classroom_2645@reddit
Than should be fixed with further optims
coherentspoon@reddit
are you guys seeing better performance with spec-draft-n-max 2 or 3?
unjustifiably_angry@reddit
Where are things on dflash?
redaktid@reddit
Gguf now!
Pjotrs@reddit (OP)
They are waiting...
AppealSame4367@reddit
I don't get it. There already are unsloth MTP gguf. They even hint to the new param syntax
redaktid@reddit
Bad tense, meant like not gguf when gguf now lol
SkyFeistyLlama8@reddit
Confirmed, Unsloth Qwen 3.6 35B MOE MTP works. I built llama-server with all latest commits and I'm getting 1.8x token generation speedup, with the MXFP4 MOE GGUF.
soyalemujica@reddit
Gave this a try with Vultan + 7900XTX , however, the speed is nice, like 27B at 50t/s up to 90t/s, however, it did reduce my context size to 110k with Q4 and 50k with Q5 model
DiscipleofDeceit666@reddit
I got this working on my RDNA2 setup. prompt read and prompt write speeds. RX 6800 + rx 6700xt. 28 Gb Vram. The 27B finally reaches usable speeds too. From \~10 tok/s on windows all the way to \~22 tok/s with MTP on rocm. The gains are real!
Defiant-Morning4442@reddit
damn finally this is happening... let's goo
DiscipleofDeceit666@reddit
I built this from source and saw massive improvements. 3 days ago, qwen3.6 moe was happily running at 30-40 tok/s vulkan. Today, I hit 70 with this and rocm.
Qwen 27B jumped from 10-15 tok/s to 20-25tok/s vulkan to this.
Fucking stoked.
a_beautiful_rhind@reddit
So you guys didn't just merge the PR prior? I mean it's really easy to try things out.
Rikers88@reddit
Waiting for the turboquant one as well! Is this also related to the dflash?
dzedaj@reddit
now when turboquant
El_90@reddit
Sincere question, and I'm incredibly grateful for all the work everyone has done, thank you.
Am I misunderstanding why people refer to this as a game changer? Pp is no faster, and TG is 1.5-2 faster, so overall this is a 20% improvement? The difference between models is greater than this, so why is 20 a game changer? It's amazing and I love it, but it's not an order of magnitude change ??. I think I'm overlooking an angle.... :)
crantob@reddit
1.5x = 50% faster
2.0x = 100% faster
El_90@reddit
But if TG is half the process (e.g. PP) you need to half that as an overall improvement
nickm_27@reddit
It's not hal especially with prompt caching, it depends on what you're using the model for
ethertype@reddit
For the math nerds: does 2x performance count as an order of magnitude when counting in binary?
ImpressiveSuperfluit@reddit
Per definition, yes.
Maleficent-Ad5999@reddit
I’m no expert.. I’m speaking solely from my experience with Qwen 3.6 27B model, the best available local model which is now my daily driver.
Previously it used to generate at 50-55 tps with llamacpp which is usable for chat.. but when I heard about MTP, I tried vllm for the first time and I was blown away to see 110-120tps which is soo good and enabled me to do that proper agentic coding.
I even finished a complete project with just this setup and a single 5090.
The token generation speed is nearly 2x now.
I think the main reason why it is a game changer is the architecture.. generating a new token is costly. But predicting the next couple of token while generating the first token and then validating is cheaper.. saves a lot of compute.. i came across this comment: “calculating a factors of a number is labour intensive but verifying a factor is cheaper and quick”
Xonzo@reddit
50% to 100% faster is a HUGE upgrade….? Could be an entirely different hardware class. Like my 5070 ti / 5060 ti going from 1100 PP / 30 TG to 1100 / 60 is huge on Qwen 27B. For Agentjc use I really start to feel it when PP is < 1,000 and TG < 40.
chocofoxy@reddit
now turboQuant please
dampflokfreund@reddit
Have they fixed the slower prompt processing?
soyalemujica@reddit
I don't think so, atm I am at 160t/s with a 7900XTX
erm_what_@reddit
For what model though
soyalemujica@reddit
Qwen 27B
Pristine-Tax4418@reddit
Is Qwen3.5-122B supported?
Pjotrs@reddit (OP)
Check Unsloth's GGUFs , it is there
iamapizza@reddit
Any idea what this is implying, why wouldn't we use ngram + MTP together on NVidia GPUs?
CockBrother@reddit
Try it. You'll probably find out what the warning is about fairly quickly. He's saying it's not reliable, not accurate, or both.
Sounds pretty great to be able to use two strategies for token prediction at the same time though.
iamapizza@reddit
Ok will do thanks. It does sound like a great combo if this could work. Interested to find out why they specifically called out cuda here. Hope it isn't some huge limitation
nickm_27@reddit
I believe I saw some comments about it. Just an issue with CUDA memory transfers so it might be slower with both enabled. Fixing this is on their post merge to-do list
gyzerok@reddit
It is implying that people like yourself should not use these combinations
iamapizza@reddit
Sorry, just trying to learn
AppealSame4367@reddit
ik_llama has a commit that already introduced this as well.
Then-Topic8766@reddit
An0n_A55a551n@reddit
Now waiting for TurboQuant PR to get merged 😔
Goldandsilverape99@reddit
Hmm, the reason i asked about support for mmproj/vision included when using mtp, is that i tried a vibe question that it failed when having the mmproj active. So i might be "supported", but the model was more stupid when using vision.
Goldandsilverape99@reddit
I got it to work, using a rebuilt mailline llamacpp and re-download:ed the mmprojfile....so it actually seams to work now...
markole@reddit
Is this Qwen3.6 only for now? Interested to see Gemma 4 MTP support.
nickm_27@reddit
Yes, Aman said Gemma will follow soon
OldComposerbruh@reddit
Super!! Let’s go
CreekyK@reddit
Letsgoo!
kwizzle@reddit
This is amazing news
Strilion@reddit
Seems like it just got merged
b0tm0de@reddit
it is merged. waiting for compile. now im starting download my new models :)
BlackBeardAI@reddit
Bring it on!
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
initalSlide@reddit
Yahaaaa finally
Borkato@reddit
Bro is a Korok
IKerimI@reddit
Merge it!
rossimo@reddit
we are just all losing it over here
deaday@reddit
Ship it!