Time to update llama.cpp to get som MTP improvements!

[-]

sultan_papagani@reddit

GUYS what if instead of everyone running LLMs themselves and struggling with hardware, we all just agreed to run the best open-source SOTA model and, like Bitcoin mining, all our computers worked together in harmony to serve us one local SOTA model :p

it would free us from updating llama.cpp every day too!!!

...besides the joke, can we run the MTP model on the iGPU so the CPU + GPU can work on the bigger model?

[-]

sultan_papagani@reddit

lmao im getting downvoted for a joke. i didnt know openclaw slop agents could downvote posts wow guys really creative

[-]

caetydid@reddit

I actually like your idea. This would finally be a serious use case for crypto.

[-]

Queasy-Contract9753@reddit

Horde

[-]

Internal_Werewolf_48@reddit

> we all just agreed to run the best open-source SOTA model and, like Bitcoin mining, all our computers worked together in harmony to serve us one local SOTA model

You'd have to implement payments or transactional accounting and fairness somehow, otherwise it would end up the same as bittorrent where a few bad leechers would kill any public instance. I also trust a random LLM compute participant even less than openrouter providers to not inject nefarious instructions mid-stream. It's too juicy of a target for malicious actors and the only alternative would be wasteful parallel consensus being computed. Single node consumer GPUs or [mostly] trusted centralized paid services are just easier.

[-]

Charming-Author4877@reddit

Only Qwen and Gemma are supported I think.
Also you need to get a fresh GGUF file with MTP support, the older ones do not have the tensors included.

[-]

Willing-Toe1942@reddit

do you have link for gemma gguf with mtp ?

[-]

Charming-Author4877@reddit

Nope, I do not find gemma4 a good model. Qwen is so much better.
I'm not sure if gemma support is included by now.
I recommend taking a look at hugginface, unsloth has updated their models with a new release adding "-MTP".

[-]

macboller@reddit

not sure why you are downvoted.

Gemma 4’s hybrid attention makes long-context weaker than a Qwen. It can use long context, but distant information would be less robust.

[-]

Borkato@reddit

Gemma blows qwen out of the water for humanlike prose and rp, but for code and agents qwen blows Gemma out of the water lol. It’s kinda crazy how much they complete each other

[-]

Stastez@reddit

I'm still wondering whether I'm using Qwen wrong. In Roo Code, Gemma 4 is much more stable with regards to tool calling and using the inbuilt features (to-do lists, sub-tasks) in my experience. I'm using both at Q6_K from unsloth.

[-]

DonkeyBonked@reddit

It's very easy to get bad results with Qwen. They're great models, but sensitive in certain aspects of their architecture. Like with kv, the key is really sensitive to quantization, but the value isn't, so you can set the kv to fp16 for k and turbo-3 for v and it's still better than Q8/Q8. When I used quantized k, I had so many tool call errors I actually started getting sick of seeing the Cline message telling me how it used advanced commands and worked better with advanced models like Claude.

Gemma models are made for this stuff, so Gemma + MTP + TurboQuant screams and doesn't seem to have any problems. Which considering both technologies were released by the company that made the model, seems reasonable.

In an equal environment though, I would definitely take Qwen over Gemma for coding. Qwen does have a special niche though in that it can be so awkward with how it tried to communicate emotional stuff that it's kind of funny and a little cute.

Gemma definitely handles interpersonal better though, I don't even think that's comparable.

[-]

LikeSaw@reddit

Thats what I thought also when I used Roo Code and then switched to Pi dev and Opencode. I had an Eureka moment ngl. You should try it and see how crazy good qwen 3.6 27b really is

[-]

Stastez@reddit

I have installed Pi as a separate user yesterday. The biggest boon to Roo in my opinion was the baked in approval system for any action it wanted to take. Plus I like GUIs.

[-]

Borkato@reddit

Very strange

[-]

DonkeyBonked@reddit

It's not surprising, I think Gemini is the most human-like of the bigger AI models, so it makes sense their open models based on that same architecture would also be this way.

They have a lot of emotional training in their models, which makes them great to talk to, even if less reliable in technical aspects.

[-]

miversen33@reddit

Inversely, it's incredible how well they work together

[-]

macboller@reddit

Imagine a MoE with both included that could dynamically select models based on context length or something

[-]

Borkato@reddit

Qwemma 3.6.4 when

[-]

DonkeyBonked@reddit

I've had pretty reasonable results with Qwen3.6 35B even at 75% full on 1M context.

[-]

macboller@reddit

Qwen is really amazing for long context.

[-]

coder543@reddit

Qwen3.5 is using hybrid attention... what are you talking about?

[-]

macboller@reddit

Gemma 4 uses local sliding-window attention in many layers, with as little as 512-token windows on smaller models and 1024-token windows on larger ones, plus periodic global attention. Its MTP support is valuable for latency, but it does not fix any long-context retrieval or reasoning weakness.

The recent Qwen style models use "Gated DeltaNet" / linear-attention layers plus full attention, which is better optimized for efficient long-context work.

That makes MTP more strategically useful on Qwen if the base model is already stronger at long-context workloads.

[-]

GreenPastures2845@reddit

Gemma4 MTP is not supported yet

[-]

Far_Course2496@reddit

Google released small assistant models for Gemma 4. I think you use spec decoding using the assistant as a draft model, not mtp.

[-]

GreenPastures2845@reddit

There is no support yet for that: https://github.com/ggml-org/llama.cpp/issues/23161

[-]

Far_Course2496@reddit

Thanks, I saw someone bench it but they must have been using a fork

[-]

endlass_imo@reddit

Wonder if this will work with OpenVino on intel HW.

[-]

JIGARAYS@reddit

its amazing! went from 41 tps to 100+ tps on 5090. qwen 3.6 27b dense model.

[-]

AnticitizenPrime@reddit

The Google Edge Gallery app for Android has also received an update to support MTP. It requires a re-download of the models.

[-]

philmarcracken@reddit

google has edging support? (¬‿¬)

[-]

cleversmoke@reddit

MTP has been solid for me, went from 27 tok/s to 50 tok/s. Any improvements on top of this is a blessing 🤩

[-]

Borkato@reddit

MTP is amazing. I genuinely thought it would be a nothingburger

[-]

DR4G0NH3ART@reddit

I am going from single digit tps to double digit. Never have been happier. Slapped my old 1660 ti to sit with my 5070 ti today, now I am at 22 gb having fun with qwen. Huge thanks to the community.

[-]

Plasmx@reddit

I could have the exact same setup but my second pcie x16 is at the bottom of the board allowing only for single slot cards…. Feels sad because I’m running out of VRAM with 16 GB. :(

[-]

yc22ovmanicom@reddit

oculink, cheap and fast

[-]

DR4G0NH3ART@reddit

Buy a riser cable, thats what I bought. Couple of things, buy a good one to avoid signal problems fire risk and driver compatibility issues. Depending on which bracket you want to install your card plan the length of the cable and angles, there are 90 degree and normal cables. I bought a thicker one and the problem I faced was that it was hard to route because the cable was thick. But it all works well and I might buy an eventual 5060 ti to spoil myself to replace that 1660 ti for 10 gig extra sweet sweet vram.

[-]

netherreddit@reddit

You can get a riser for that situation

[-]

DonkeyBonked@reddit

I felt like both MTP and TurboQuant were much needed improvements, and I have been waiting impatiently for both to be available stable together. They really do drastically change what is viable to run.

[-]

Borkato@reddit

Ugh I still need to try turboquant. Does it work in mainline now?

[-]

DonkeyBonked@reddit

No, they keep talking about it, but it's not in yet. I'm running Tom's build that I fast forward with main to get MTP.

[-]

annodomini@reddit

TurboQuant as a whole is not in mainline.

One fairly simple part of it was implemented, as a way to improve quantization performance without making any significant complicated changes: https://github.com/ggml-org/llama.cpp/pull/21038

[-]

DonkeyBonked@reddit

So far, I've managed to get Qwen3.6 27B into the mid 60s~ for tokens/s to start, with the best I've seen around 40s~ at 100k and 20s~ at 200k context on 4x 3090s.

It depends on the models, but I'm getting very mixed results using MTP with TurboQuant.

Like just TurboQuant or just MTP seem to be better than both TurboQuant and MTP. I really wish the official fork supported both.

I spent more time than I'm proud of yesterday fast forwarding Tom's fork with the main to get TQ and MTP together, and maybe I screwed something up but the results were not impressive.

[-]

etaoin314@reddit

I am on 3 3090s and can get 27B at q8 running in the 80s tps with MTP support, but when I try to add the turboquant it seems to lobotomize it and everything starts to go to shit. maybe Im doing something wrong, but I gave up on truboquant for now, MTP is plenty for my needs. I'll get back to turbo when it is part of mainline.

[-]

DonkeyBonked@reddit

My best results with TurboQuant have been keeping k at fp16 (or whatever the base is) and v at turbo-3, there is a known issue with TQ on k. I've done Q8, but honestly, I find it's just best not to mess with it, but on v it doesn't seem to make any noticeable difference.

What settings are you using to get 80s tps?

If I could get that on Q8 that's what I'd be using.

[-]

AdamDhahabi@reddit

80 t/s should now be reachable on 2x 3090 because this commit allows for -sm tensor to be combined with MTP!!

[-]

DonkeyBonked@reddit

Without a NVLink?

[-]

etaoin314@reddit

i was on vllm as backend and had it with 100k context and I think both the k and v cache were at q8 but now I am doubting myself and mtp was active. I think that is about it, If I remember I can paste in my compose when I get home.

[-]

DonkeyBonked@reddit

I'm going to do some testing with it tonight, I would really like to get the speed higher, but I've only broken into the 60s at best and that was with Q4_K_M and MTP on llama.cpp

[-]

etaoin314@reddit

yeah I think that is what I was getting on llama.cpp as well, vllm is better in my testing, it was worth \~20tps

[-]

Ok-Measurement-1575@reddit

Why would you bother with turboquant with 4 x 3090?

-sm tensor and mtp will see you into the 80t/s.

[-]

StardockEngineer@reddit

As of right now, it has been released. Merged 4 hrs ago, last release 16 hrs ago.

[-]

cafedude@reddit

I'll wait for it to be in a tag.

[-]

cptbeard@reddit

CI pipeline failed jobs are being rerun now https://github.com/ggml-org/llama.cpp/actions/runs/26097391103/job/76816720480#logs

[-]

jeekp@reddit

heck yeah! Ran a quick comparison:

GPU: RTX 5090 (400W Power Limited)
Context: 40K Token Prompt
Model: Qwen 3.6 27B Unsloth Q6_K
llama.cpp version: 9237

Results:
Prompt Processing: 1922 t/s -> 1653 t/s (0.86x slower)
Token Output: went from 41.11 t/s -> 78.15 t/s (1.9x faster)
Total Duration: 3m31s -> 2m03s (1.72x faster)

Is PP meant to be slower with MTP, or is this a GGUF / llama.cpp issue?

[-]

Sisaroth@reddit

i'm new to local models and agentic coding. I was trying Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL MTP with llama.cpp and cline but it kept looping over very basic things. Like tests failed and it keep trying to run the tests again with no changes.

ollama with default qwen3.6 however was working very well on the other hand, just much less tokens/s.

[-]

Ok-Measurement-1575@reddit

This quant no longer calls tools for me on the latest builds.

Dunno if it's chat template or the quant itself.

Pretty annoying.

[-]

Sisaroth@reddit

I changed to many variables at once I think. I'm now trying llama.cpp with the exact same quant that worked so well in ollama (Q4_K_M), will see tomorrow if it works.

Just a bit frustrating that I see many people say not to use ollama, but it just works. While I have struggled a whole day now to get anything good out of llama.cpp.

[-]

Borkato@reddit

Are you using the official jinja template?

[-]

zkkzkk32312@reddit

Might need to find a proper Jinja template to use

[-]

higglesworth@reddit

Trying to run Qwen3.6 27b (unsloth MTP gguf) with MTP enabled from latest pull and it's just giving me a line of 'thinking' (which appear to be chinese?) and no actual output. I see in the llama-server logs " forcing full prompt re-processing due to lack of cache data " over and over. Does anyone have any idea of what this thing is doing?

[-]

StardockEngineer@reddit

Do any of these look like your problem? https://github.com/ggml-org/llama.cpp/issues?q=is%3Aissue%20state%3Aopen%20forcing%20full%20prompt%20re-processing%20due%20to%20lack%20of%20cache%20data

[-]

higglesworth@reddit

Yeah they could be…I’m running a sycl built locally. Haven’t had a lot of time to mess with it today but I’ll try a vulkan build later and also with removing the mtp draft args from the server launch

[-]

Borkato@reddit

I’ve had that warning message for weeks and most ignored it and it’s been fine, double check other settings?

[-]

StardockEngineer@reddit

Eh that isn't fine. It means it's reprocessing the whole conversation from scratch.

[-]

Borkato@reddit

Oh, I meant fine as in “I don’t think that’s related to it thinking in Chinese and not outputting” haha.

If you do find a fix for that warning though I would love that!

[-]

Queasy-Contract9753@reddit

I got that a lot but more so when using their webUI. I'm not sure if I'm imagining things or if a UI alone can do that but when I use other clients it doesn't happen nearly as much. Qwen 3.5 0.8b and 2b in my case.

[-]

quasoft@reddit

Was going to make a post about it, bit will instead just ask here.

Is there some list/collection of what models are actually supported by the new llama.cpp MTP implementation right now.

What I figured is the newer Qwen models are already working and have compatible quants from unsloth and bartowski.

What else?

Didn't see anyone using it with Gemma 4 yet.

[-]

miversen33@reddit

Gemma 4 MTP is different (MTP heads are in a separate model) than Qwen 3.6 MTP

https://github.com/ggml-org/llama.cpp/issues/23161

[-]

PixelatedCaffeine@reddit (OP)

that one was already there! what got merged now is a PR with some MTP cleanups. the binary is not on the releases page yet, but we can pull from the repo and build from source to get it already

[-]

Low-Alarm272@reddit

How to do it? I have llama.cpp shall I just update it?

Also, I want to know if the pe builds already have MTP or not? (Like Ubuntu Vulkan)