Time to update llama.cpp to get som MTP improvements!
Posted by PixelatedCaffeine@reddit | LocalLLaMA | View on Reddit | 74 comments
Posted by PixelatedCaffeine@reddit | LocalLLaMA | View on Reddit | 74 comments
sultan_papagani@reddit
GUYS what if instead of everyone running LLMs themselves and struggling with hardware, we all just agreed to run the best open-source SOTA model and, like Bitcoin mining, all our computers worked together in harmony to serve us one local SOTA model :p
it would free us from updating llama.cpp every day too!!!
...besides the joke, can we run the MTP model on the iGPU so the CPU + GPU can work on the bigger model?
sultan_papagani@reddit
lmao im getting downvoted for a joke. i didnt know openclaw slop agents could downvote posts wow guys really creative
caetydid@reddit
I actually like your idea. This would finally be a serious use case for crypto.
Queasy-Contract9753@reddit
Horde
Internal_Werewolf_48@reddit
> we all just agreed to run the best open-source SOTA model and, like Bitcoin mining, all our computers worked together in harmony to serve us one local SOTA model
You'd have to implement payments or transactional accounting and fairness somehow, otherwise it would end up the same as bittorrent where a few bad leechers would kill any public instance. I also trust a random LLM compute participant even less than openrouter providers to not inject nefarious instructions mid-stream. It's too juicy of a target for malicious actors and the only alternative would be wasteful parallel consensus being computed. Single node consumer GPUs or [mostly] trusted centralized paid services are just easier.
Charming-Author4877@reddit
Only Qwen and Gemma are supported I think.
Also you need to get a fresh GGUF file with MTP support, the older ones do not have the tensors included.
Willing-Toe1942@reddit
do you have link for gemma gguf with mtp ?
Charming-Author4877@reddit
Nope, I do not find gemma4 a good model. Qwen is so much better.
I'm not sure if gemma support is included by now.
I recommend taking a look at hugginface, unsloth has updated their models with a new release adding "-MTP".
macboller@reddit
not sure why you are downvoted.
Gemma 4’s hybrid attention makes long-context weaker than a Qwen. It can use long context, but distant information would be less robust.
Borkato@reddit
Gemma blows qwen out of the water for humanlike prose and rp, but for code and agents qwen blows Gemma out of the water lol. It’s kinda crazy how much they complete each other
Stastez@reddit
I'm still wondering whether I'm using Qwen wrong. In Roo Code, Gemma 4 is much more stable with regards to tool calling and using the inbuilt features (to-do lists, sub-tasks) in my experience. I'm using both at Q6_K from unsloth.
DonkeyBonked@reddit
It's very easy to get bad results with Qwen. They're great models, but sensitive in certain aspects of their architecture. Like with kv, the key is really sensitive to quantization, but the value isn't, so you can set the kv to fp16 for k and turbo-3 for v and it's still better than Q8/Q8. When I used quantized k, I had so many tool call errors I actually started getting sick of seeing the Cline message telling me how it used advanced commands and worked better with advanced models like Claude.
Gemma models are made for this stuff, so Gemma + MTP + TurboQuant screams and doesn't seem to have any problems. Which considering both technologies were released by the company that made the model, seems reasonable.
In an equal environment though, I would definitely take Qwen over Gemma for coding. Qwen does have a special niche though in that it can be so awkward with how it tried to communicate emotional stuff that it's kind of funny and a little cute.
Gemma definitely handles interpersonal better though, I don't even think that's comparable.
LikeSaw@reddit
Thats what I thought also when I used Roo Code and then switched to Pi dev and Opencode. I had an Eureka moment ngl. You should try it and see how crazy good qwen 3.6 27b really is
Stastez@reddit
I have installed Pi as a separate user yesterday. The biggest boon to Roo in my opinion was the baked in approval system for any action it wanted to take. Plus I like GUIs.
Borkato@reddit
Very strange
DonkeyBonked@reddit
It's not surprising, I think Gemini is the most human-like of the bigger AI models, so it makes sense their open models based on that same architecture would also be this way.
They have a lot of emotional training in their models, which makes them great to talk to, even if less reliable in technical aspects.
miversen33@reddit
Inversely, it's incredible how well they work together
macboller@reddit
Imagine a MoE with both included that could dynamically select models based on context length or something
Borkato@reddit
Qwemma 3.6.4 when
DonkeyBonked@reddit
I've had pretty reasonable results with Qwen3.6 35B even at 75% full on 1M context.
macboller@reddit
Qwen is really amazing for long context.
coder543@reddit
Qwen3.5 is using hybrid attention... what are you talking about?
macboller@reddit
Gemma 4 uses local sliding-window attention in many layers, with as little as 512-token windows on smaller models and 1024-token windows on larger ones, plus periodic global attention. Its MTP support is valuable for latency, but it does not fix any long-context retrieval or reasoning weakness.
The recent Qwen style models use "Gated DeltaNet" / linear-attention layers plus full attention, which is better optimized for efficient long-context work.
That makes MTP more strategically useful on Qwen if the base model is already stronger at long-context workloads.
GreenPastures2845@reddit
Gemma4 MTP is not supported yet
Far_Course2496@reddit
Google released small assistant models for Gemma 4. I think you use spec decoding using the assistant as a draft model, not mtp.
GreenPastures2845@reddit
There is no support yet for that: https://github.com/ggml-org/llama.cpp/issues/23161
Far_Course2496@reddit
Thanks, I saw someone bench it but they must have been using a fork
endlass_imo@reddit
Wonder if this will work with OpenVino on intel HW.
JIGARAYS@reddit
its amazing! went from 41 tps to 100+ tps on 5090. qwen 3.6 27b dense model.
AnticitizenPrime@reddit
The Google Edge Gallery app for Android has also received an update to support MTP. It requires a re-download of the models.
philmarcracken@reddit
google has edging support? (¬‿¬)
cleversmoke@reddit
MTP has been solid for me, went from 27 tok/s to 50 tok/s. Any improvements on top of this is a blessing 🤩
Borkato@reddit
MTP is amazing. I genuinely thought it would be a nothingburger
DR4G0NH3ART@reddit
I am going from single digit tps to double digit. Never have been happier. Slapped my old 1660 ti to sit with my 5070 ti today, now I am at 22 gb having fun with qwen. Huge thanks to the community.
Plasmx@reddit
I could have the exact same setup but my second pcie x16 is at the bottom of the board allowing only for single slot cards…. Feels sad because I’m running out of VRAM with 16 GB. :(
yc22ovmanicom@reddit
oculink, cheap and fast
DR4G0NH3ART@reddit
Buy a riser cable, thats what I bought. Couple of things, buy a good one to avoid signal problems fire risk and driver compatibility issues. Depending on which bracket you want to install your card plan the length of the cable and angles, there are 90 degree and normal cables. I bought a thicker one and the problem I faced was that it was hard to route because the cable was thick. But it all works well and I might buy an eventual 5060 ti to spoil myself to replace that 1660 ti for 10 gig extra sweet sweet vram.
netherreddit@reddit
You can get a riser for that situation
DonkeyBonked@reddit
I felt like both MTP and TurboQuant were much needed improvements, and I have been waiting impatiently for both to be available stable together. They really do drastically change what is viable to run.
Borkato@reddit
Ugh I still need to try turboquant. Does it work in mainline now?
DonkeyBonked@reddit
No, they keep talking about it, but it's not in yet. I'm running Tom's build that I fast forward with main to get MTP.
annodomini@reddit
TurboQuant as a whole is not in mainline.
One fairly simple part of it was implemented, as a way to improve quantization performance without making any significant complicated changes: https://github.com/ggml-org/llama.cpp/pull/21038
DonkeyBonked@reddit
So far, I've managed to get Qwen3.6 27B into the mid 60s~ for tokens/s to start, with the best I've seen around 40s~ at 100k and 20s~ at 200k context on 4x 3090s.
It depends on the models, but I'm getting very mixed results using MTP with TurboQuant.
Like just TurboQuant or just MTP seem to be better than both TurboQuant and MTP. I really wish the official fork supported both.
I spent more time than I'm proud of yesterday fast forwarding Tom's fork with the main to get TQ and MTP together, and maybe I screwed something up but the results were not impressive.
etaoin314@reddit
I am on 3 3090s and can get 27B at q8 running in the 80s tps with MTP support, but when I try to add the turboquant it seems to lobotomize it and everything starts to go to shit. maybe Im doing something wrong, but I gave up on truboquant for now, MTP is plenty for my needs. I'll get back to turbo when it is part of mainline.
DonkeyBonked@reddit
My best results with TurboQuant have been keeping k at fp16 (or whatever the base is) and v at turbo-3, there is a known issue with TQ on k. I've done Q8, but honestly, I find it's just best not to mess with it, but on v it doesn't seem to make any noticeable difference.
What settings are you using to get 80s tps?
If I could get that on Q8 that's what I'd be using.
AdamDhahabi@reddit
80 t/s should now be reachable on 2x 3090 because this commit allows for -sm tensor to be combined with MTP!!
DonkeyBonked@reddit
Without a NVLink?
etaoin314@reddit
i was on vllm as backend and had it with 100k context and I think both the k and v cache were at q8 but now I am doubting myself and mtp was active. I think that is about it, If I remember I can paste in my compose when I get home.
DonkeyBonked@reddit
I'm going to do some testing with it tonight, I would really like to get the speed higher, but I've only broken into the 60s at best and that was with Q4_K_M and MTP on llama.cpp
etaoin314@reddit
yeah I think that is what I was getting on llama.cpp as well, vllm is better in my testing, it was worth \~20tps
Ok-Measurement-1575@reddit
Why would you bother with turboquant with 4 x 3090?
-sm tensor and mtp will see you into the 80t/s.
StardockEngineer@reddit
As of right now, it has been released. Merged 4 hrs ago, last release 16 hrs ago.
cafedude@reddit
I'll wait for it to be in a tag.
cptbeard@reddit
CI pipeline failed jobs are being rerun now https://github.com/ggml-org/llama.cpp/actions/runs/26097391103/job/76816720480#logs
jeekp@reddit
heck yeah! Ran a quick comparison:
GPU: RTX 5090 (400W Power Limited)
Context: 40K Token Prompt
Model: Qwen 3.6 27B Unsloth Q6_K
llama.cpp version: 9237
Results:
Prompt Processing: 1922 t/s -> 1653 t/s (0.86x slower)
Token Output: went from 41.11 t/s -> 78.15 t/s (1.9x faster)
Total Duration: 3m31s -> 2m03s (1.72x faster)
Is PP meant to be slower with MTP, or is this a GGUF / llama.cpp issue?
Sisaroth@reddit
i'm new to local models and agentic coding. I was trying Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL MTP with llama.cpp and cline but it kept looping over very basic things. Like tests failed and it keep trying to run the tests again with no changes.
ollama with default qwen3.6 however was working very well on the other hand, just much less tokens/s.
Ok-Measurement-1575@reddit
This quant no longer calls tools for me on the latest builds.
Dunno if it's chat template or the quant itself.
Pretty annoying.
Sisaroth@reddit
I changed to many variables at once I think. I'm now trying llama.cpp with the exact same quant that worked so well in ollama (Q4_K_M), will see tomorrow if it works.
Just a bit frustrating that I see many people say not to use ollama, but it just works. While I have struggled a whole day now to get anything good out of llama.cpp.
Borkato@reddit
Are you using the official jinja template?
zkkzkk32312@reddit
Might need to find a proper Jinja template to use
higglesworth@reddit
Trying to run Qwen3.6 27b (unsloth MTP gguf) with MTP enabled from latest pull and it's just giving me a line of 'thinking' (which appear to be chinese?) and no actual output. I see in the llama-server logs " forcing full prompt re-processing due to lack of cache data " over and over. Does anyone have any idea of what this thing is doing?
StardockEngineer@reddit
Do any of these look like your problem? https://github.com/ggml-org/llama.cpp/issues?q=is%3Aissue%20state%3Aopen%20forcing%20full%20prompt%20re-processing%20due%20to%20lack%20of%20cache%20data
higglesworth@reddit
Yeah they could be…I’m running a sycl built locally. Haven’t had a lot of time to mess with it today but I’ll try a vulkan build later and also with removing the mtp draft args from the server launch
Borkato@reddit
I’ve had that warning message for weeks and most ignored it and it’s been fine, double check other settings?
StardockEngineer@reddit
Eh that isn't fine. It means it's reprocessing the whole conversation from scratch.
Borkato@reddit
Oh, I meant fine as in “I don’t think that’s related to it thinking in Chinese and not outputting” haha.
If you do find a fix for that warning though I would love that!
Queasy-Contract9753@reddit
I got that a lot but more so when using their webUI. I'm not sure if I'm imagining things or if a UI alone can do that but when I use other clients it doesn't happen nearly as much. Qwen 3.5 0.8b and 2b in my case.
quasoft@reddit
Was going to make a post about it, bit will instead just ask here.
Is there some list/collection of what models are actually supported by the new llama.cpp MTP implementation right now.
What I figured is the newer Qwen models are already working and have compatible quants from unsloth and bartowski.
What else?
Didn't see anyone using it with Gemma 4 yet.
miversen33@reddit
Gemma 4 MTP is different (MTP heads are in a separate model) than Qwen 3.6 MTP
https://github.com/ggml-org/llama.cpp/issues/23161
xoovs@reddit
Has anyone managed to utilise MTP with SYCL?
blackhawk00001@reddit
I have to benchmark AGAIN?
our_sole@reddit
Does this mean the gh llama.cpp releases page has the binary with mtp support?
PixelatedCaffeine@reddit (OP)
that one was already there! what got merged now is a PR with some MTP cleanups. the binary is not on the releases page yet, but we can pull from the repo and build from source to get it already
Low-Alarm272@reddit
How to do it? I have llama.cpp shall I just update it?
Also, I want to know if the pe builds already have MTP or not? (Like Ubuntu Vulkan)