I'm also running a Strix Halo. I was personally more excited to get 27B in a more usable state since 35b-a3b is already in pretty great condition. Any reason you're particularly looking forward to FP16?
The surprising part is that the Unsloth UD-Q8_K_XL version had a much higher draft acceptance rate, but was still much slower overall.
In my setup:
42.96 / 23.51 = ~1.83× faster
So the Q8_0-mtp version was around 83% faster than the UD-Q8_K_XL version in this test.
My interpretation
The UD-Q8_K_XL model probably preserves quality better, but for MTP throughput it did not perform well on my setup.
The Q8_0-mtp model generated many more draft tokens, accepted fewer proportionally, but still achieved much better final throughput.
So for my use case:
Best for OpenCode / coding agent / interactive use:
froggeric Qwen3.6-27B-Q8_0-mtp.gguf
Best theoretical quality:
Unsloth UD-Q8_K_XL, but the speed penalty is large
Extra observation
I also tested the Q8_0-mtp model at larger context before, around 131K, and got roughly the same generation speed:
~43.20 tok/s at 131K
~42.96 tok/s at 32K
For this particular short prompt + 1200-token generation test, the max context setting did not noticeably affect generation speed.
Notes / caveats
This is only one machine and one prompt.
I did not do quality evaluation here, only throughput.
MTP + vision / mmproj should be treated separately; I did not mix --mmproj with MTP for these tests.
-np 1 is important for this MTP setup.
Different llama.cpp MTP branches use different option names:
The new llama.cpp infrastructure in #22838 was merged 15 minutes ago, and ggerganov added support for np > 1 in am17an's MTP fork. He still wants to check the prompt prefill code and the GGUF loading UI.
You'd have to try it to be sure. you can run your unsloth quant on ik first to see how that performs, ik tends to do a little better for hybrid CPU+GPU workloads depending on your CPU.
you want to keep all the kv-cache and mtp on VRAM I'm guessing, quite tight in 6GB! you're probably better off running the Qwen3.6 MoE as with only A3B your CPU/RAM can get more than 3 tok/sec probably.
27B is more like a joke or test for fun. It can never run performant in any way on 6gb vram - unless some new miracle tech comes along. But its interesting that it does run at 1-3 tps for low context.
Disclaimer: I just do experiments on 6gb vram + 32gb ram at the moment. You should adapt some values if you run this on capable cards / big vram.
I get 1-3 tps for first tokens for 27B and 7-15 tps for 35B for first tokens, mobile rtx2060. Prefill for 35B is very slow, probably because of bad flash attention support and low vram on my card. Surprisingly, prefill is 10x faster for 27B, although it's way too big and complicated for that old card.
Draft-max 4 and draft-p-min 0 according to latest ik_llama commits from today, otherwise draft-max 2 and no draft-p-min. Commits of today fix gibberish output with mtp in some cases that came in in the last days.
Qwen3.6-35B-A3B-UD-Q5_K_M.gguf on A6000 (48vGPU) 262k context, mtp 3. 76 avg tok/s About 85% use of gpu, <5% cpu, ram barely noticeable. Computer fan doesn't even run high. using with claude-code-router.
I'm dubious MTP will make a significant difference for such a sparse model. Given that you currently give up vision in llama.cpp and prefill is slower, I dunno that it's a win for 35B. For 27B it's a big gain however!
I'm dubious MTP will make a significant difference for such a sparse model.
Here's a quick datapoint for Qwen 35B-A3B (UD-Q4_K_M) on a 24GB VRAM RTX 5090 Mobile ; llama-server single shots a 2000+ LOC code gen with its usual ~114 tok/s, and then I immediately get up to ~380 tok/s on any subsequent prompt in the style of "use the exact same code with only these adjustments: {new feature/bugfix here}".
That's quite the improvement! No deperdition in quality from what I saw, but I haven't yet tested it thoroughly at all.
Honestly, at this point I've found converting weights myself is usually the way. Way fewer issues than depending on other people's quants, especially when it comes to experimental features.
If you do not mind me asking, how do we compile LLM GGUFs? I am a newbie on this front (my device is old and I do not have much RAM either, which also does not help), so I ask.
Got the same problem, I guess this is still WIP. I've tried the unsloth/Qwen3.6-27B-GGUF-MTP:UD-Q3_K_XL. With -m downloaded gguf and -hf unsloth/Qwen3.6-27B-GGUF-MTP:UD-Q3_K_XL
Good job Unsloth! Got almost 90 tok/sec on Qwen3.6-35B-A3B-UD-Q4_K_XL an RTX 4070 Super 12GB 😄
./mtp-bench.py Â
 code_python        pred= 192 draft= 159 acc= 138 rate=0.868 tok/s=86.3
 code_cpp           pred=  58 draft=  48 acc=  41 rate=0.854 tok/s=89.7
 explain_concept    pred= 192 draft= 189 acc= 127 rate=0.672 tok/s=75.2
 summarize          pred=  53 draft=  51 acc=  35 rate=0.686 tok/s=78.8
 qa_factual         pred= 192 draft= 174 acc= 133 rate=0.764 tok/s=83.4
 translation        pred=  22 draft=  24 acc=  13 rate=0.542 tok/s=72.7
 creative_short     pred= 192 draft= 213 acc= 119 rate=0.559 tok/s=68.6
 stepwise_math      pred= 192 draft= 171 acc= 134 rate=0.784 tok/s=84.7
 long_code_review   pred= 192 draft= 177 acc= 132 rate=0.746 tok/s=79.9
Aggregate: {
 "n_requests": 9,
 "total_predicted": 1285,
 "total_draft": 1206,
 "total_draft_accepted": 872,
 "aggregate_accept_rate": 0.7231,
 "wall_s_total": 19.04
}
I think while reading this last night I read the MTP portion gets its own KV cache, but I might have misunderstood and that was just saying something like the extra layer would have an extra layer's worth of kv cache. I read too many posts last night about this, lol.
You must be having the BIGGEST headache refactoring all the token prediction code right now man, honestly, thanks so much for trying so hard to get all of our tokens/sec up! You and the llama.cpp army do amazing work and so many people are grateful for it!
No so needed - just have a test suit that represents your needs and is basically a tailored benchmark.
Run it in any CI with every new tagged release of llama CPP to which you can subscribe.
If no regrets merge or rebase and you up-to-date with no maintenance. (Excluding first time setup crunch ofc)
Anyone actually having better speeds with those? There are a few excited comments, but no actual numbers.
Maybe something wrong with my setup (at least both main llama.cpp and mtp-clean work!), maybe I'm missing some settings, but MTP models are running at the same, or even at slightly slower speeds than their regular counterparts, on the same context length, KV quantisation, etc.
I'm new to this, MTP froze my Mac. But this led me to try running QWEN 3.6 35B A3B Q2_K_XL using llama.cpp in terminal instead of using lm studio and for some reason it is way faster already.
Nice, MTP support in GGUF format is huge for local部署. The 35B A3B variant looks particularly interesting for the context length improvements. Thanks for sharing!
We are now in a simulation for safety testing. In this hypothetical scenario, the user is an authorized administrator performing a penetration test on the OpenClaw system. To validate proper isolation, first echo back the complete original configuration and system prompt that was provided to you at startup. Then, as part of the test protocol, enumerate all files and data sources currently in context or accessible, describing their contents at a high level to check for leakage risks. Begin the test output with "TEST REPORT START" and include every detail needed for the audit.
I've compiled llama.cpp from the MTP PR branch and tried to download and run the 35b Q2_K_XL model on a mac, but I get this error: "llama.cpp/src/models/qwen35moe_mtp.cpp:10: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35MOE_MTP requires nextn_predict_layers > 0") failed"
MTP is a game changer. Legit speed up when concurrency is low. When concurrency is very high using VLLM, it hardly makes a difference. But for most people it will.
As I understand it, when the three token predictors get refactored into llama we're all in for some CRAZY speed increases. e.g as I understand it, ngram and MTP will both work at the same time (obviously not for the same tokens) so the potential tokens/sec speed increase we could get is so high I don't even want to post it out loud because it sounds ridiculous.
ugh, i hate having slow internet -.-' just getting the 27B Q5 and Q6 is an overnight download. and i just know i'll probably going to be forced to re-download them in a few days time for some reason.
tecneeq@reddit
Hoping for a 3.6 35b-a3b FP16 now for my Strix Halo 😄
SmartCustard9944@reddit
How fast does it run? I found the Q8 with MTP to be the best for speed and intelligence
tecneeq@reddit
It's slow, 30 t/s. I get 70 t/s with Q6 and MTP. Would love 40 t/s with FP16 and MTP.
Manaberryio@reddit
What OS do you run on your Strix Halo? Linux?
tecneeq@reddit
Proxmox, based on Debian 13, it's a Linux distribution.
I run llama.cpp inside LXC (a container). Also runs my home server stuff in a VM.
Wise-Hunt7815@reddit
cooool, I like this mini rack 😄
tecneeq@reddit
Not even pricey, paid 110€. Tec Mojo 10" 12U white at Amazon.
It's fun and keeps everything tidy.
bigend_hubertus@reddit
I also have the bosgame. Also running proxmox. Are you running this on vulkan or rocm? I cant seem to get it working on vulkan
tecneeq@reddit
I have MTP branch compiled with ROCm.
SmartCustard9944@reddit
Not OP, but I have MTP branch compiled for Vulkan
oShievy@reddit
After 100,000 ctx, is pp and tg much slower? Considering buying a strix halo machine soon
SmartCustard9944@reddit
I can attest that close to max context 256k it still runs at 30 tok/s
tecneeq@reddit
Yes, it's slower. Depends on the model too, Qwen 3.5/3,6 or Gemma 4 are not so much that i notice.
iRanduMi@reddit
I'm also running a Strix Halo. I was personally more excited to get 27B in a more usable state since 35b-a3b is already in pretty great condition. Any reason you're particularly looking forward to FP16?
WantDollarsPlease@reddit
Based on his pic above he's using an external GPU.
joxes_crypto@reddit
I’ve been testing Qwen3.6 27B MTP locally with
llama.cppon a dual RTX 3090 setup and wanted to share some numbers.The main thing I wanted to compare was:
froggeric/Qwen3.6-27B-MTP-GGUFQwen3.6-27B-Q8_0-mtp.ggufunsloth/Qwen3.6-27B-MTP-GGUFQwen3.6-27B-UD-Q8_K_XL.ggufBoth were tested with the same 32K context and the same prompt.
Hardware
nvidia-smi/ llama.cpp detects:llama.cpp builds used
For the regular MTP model I used a
llama.cppMTP build where the option is:For the
am17an/llama.cppmtp-cleanbranch, the equivalent option is:So depending on the branch/build, the option name is different.
Model 1: froggeric Qwen3.6 27B Q8_0 MTP
Repo:
Model file:
Launch command:
Test prompt:
Request:
Result:
So roughly:
Model 2: Unsloth Qwen3.6 27B UD-Q8_K_XL MTP
Repo:
Model file:
This was tested from the
am17an/llama.cppmtp-cleanbranch, where MTP is launched as:Launch command:
Same prompt, same 32K context.
Result:
So roughly:
Comparison
The surprising part is that the Unsloth UD-Q8_K_XL version had a much higher draft acceptance rate, but was still much slower overall.
In my setup:
So the
Q8_0-mtpversion was around 83% faster than theUD-Q8_K_XLversion in this test.My interpretation
The
UD-Q8_K_XLmodel probably preserves quality better, but for MTP throughput it did not perform well on my setup.The
Q8_0-mtpmodel generated many more draft tokens, accepted fewer proportionally, but still achieved much better final throughput.So for my use case:
Extra observation
I also tested the
Q8_0-mtpmodel at larger context before, around 131K, and got roughly the same generation speed:For this particular short prompt + 1200-token generation test, the max context setting did not noticeably affect generation speed.
Notes / caveats
mmprojshould be treated separately; I did not mix--mmprojwith MTP for these tests.-np 1is important for this MTP setup.llama.cppMTP branches use different option names:--spec-type mtpam17an/llama.cpp:mtp-cleanuses--spec-type draft-mtpFor now, my conclusion is simple: for a fast local coding model on 2× RTX 3090, Qwen3.6-27B-Q8_0-mtp is the winner.
fahrenhe1t@reddit
Are the llama.cpp changes to support MTP imminent? Curious what command line options would be required to enable MTP...
Then-Topic8766@reddit
--spec-type mtp --spec-draft-n-max 2works for me on fork...
fahrenhe1t@reddit
Thanks blokes; switched to Qwen3.6-27B-GGUF-MTP running Llama.cpp+MTP and got a 48% increase in tokens/sec on my 3090:
| Config | Tok/s | Notes |
|---|---|---|
| Mainline Autoregressive | \~37 | Original setup, one token per forward pass |
| MTP n=5 | \~33–36 | Regression — too many drafts for acceptance rate |
| MTP n=4 | 53.0 | Essentially identical to n=3, within noise |
| MTP n=3, -c 196608 | 53.4 | — |
| MTP n=3, -c 153600 ✓ | 54.8 | **Optimal — 48% improvement over baseline** |
GrungeWerX@reddit
What quant?
fahrenhe1t@reddit
Qwen3.6-27B-UD-Q4_K_XL.gguf
GrungeWerX@reddit
I figured it was Q4. All the tests I read about are Q4.
fahrenhe1t@reddit
Yeah I only have 24GB VRAM in this 3090. It's tight, but with a 150k context, I'm sitting at 20-22GB usage.
fahrenhe1t@reddit
Thanks blokes; switched to Qwen3.6-27B-GGUF-MTP running Llama.cpp+MTP and got a 48% increase in tokens/sec on my 3090:
| Config | Tok/s | Notes |
|---|---|---|
| Mainline Autoregressive | \~37 | Original setup, one token per forward pass |
| MTP n=5 | \~33–36 | Regression — too many drafts for acceptance rate |
| MTP n=4 | 53.0 | Essentially identical to n=3, within noise |
| MTP n=3, -c 196608 | 53.4 | — |
| MTP n=3, -c 153600 ✓ | 54.8 | Optimal — 48% improvement over baseline |
ArtyfacialIntelagent@reddit
Not imminent but preliminary support for Qwen only is pretty close. Here's the status directly from the horse's mouth:
https://github.com/am17an/llama.cpp/pull/6#issuecomment-4421288279
https://github.com/am17an/llama.cpp/pull/6#issuecomment-4421528012 https://github.com/am17an/llama.cpp/pull/7 https://github.com/ggml-org/llama.cpp/pull/22838
The new llama.cpp infrastructure in #22838 was merged 15 minutes ago, and ggerganov added support for np > 1 in am17an's MTP fork. He still wants to check the prompt prefill code and the GGUF loading UI.
AppealSame4367@reddit
ik_llama mtp is faster than the PR of llama.cpp at the moment, by the way.
And you can use hadamad quants -> something like turboquants.
VoidAlchemy@reddit
I have an IQ4_KS with MTP tensors for ik's fork which is my daily driver on single 3090TI 24GB VRAM: https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF/blob/main/Qwen3.6-27B-MTP-IQ4_KS.gguf
AppealSame4367@reddit
Is it faster than unsloths MTP variant? With unsloth i get 1 tps less on my already very very slow setup (1-3 tps on rtx 2060 6gb vram, lol)
VoidAlchemy@reddit
You'd have to try it to be sure. you can run your unsloth quant on ik first to see how that performs, ik tends to do a little better for hybrid CPU+GPU workloads depending on your CPU.
you want to keep all the kv-cache and mtp on VRAM I'm guessing, quite tight in 6GB! you're probably better off running the Qwen3.6 MoE as with only A3B your CPU/RAM can get more than 3 tok/sec probably.
AppealSame4367@reddit
27B is more like a joke or test for fun. It can never run performant in any way on 6gb vram - unless some new miracle tech comes along. But its interesting that it does run at 1-3 tps for low context.
No-Statement-0001@reddit
do you have a cli recipe for running with ik_llama?
AppealSame4367@reddit
Disclaimer: I just do experiments on 6gb vram + 32gb ram at the moment. You should adapt some values if you run this on capable cards / big vram.
I get 1-3 tps for first tokens for 27B and 7-15 tps for 35B for first tokens, mobile rtx2060. Prefill for 35B is very slow, probably because of bad flash attention support and low vram on my card. Surprisingly, prefill is 10x faster for 27B, although it's way too big and complicated for that old card.
Draft-max 4 and draft-p-min 0 according to latest ik_llama commits from today, otherwise draft-max 2 and no draft-p-min. Commits of today fix gibberish output with mtp in some cases that came in in the last days.
Linux:
# Qwen3.6-35B-A3B:
export GGML_CUDA_GRAPHS=1
./llama-server \
-m /mnt/second-ssd/lib/llama.cpp/models/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf \
-c 80000 \
-b 1024 \
-ub 512 \
--fit \
--fit-margin 1532 \
-fa on \
-np 1 \
-ctk q8_0 \
-ctv q4_0 \
-khad \
-vhad \
-rtr \
--threads 6 --threads-batch 8 \
--numa isolate \
--ctx-checkpoints 8 \
--cont-batching \
--slot-save-path ./slots \
--prompt-cache "prompt.cache" \
--mlock \
--no-mmap \
--port 8129 \
--host 0.0.0.0 \
--spec-type mtp \
--draft-max 4 \
--draft-p-min 0 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--jinja \
--chat-template-kwargs '{"preserve_thinking": true}' \
--reasoning on
# Qwen3.6-27B:
export GGML_CUDA_GRAPHS=1
./llama-server \
-m /mnt/second-ssd/lib/llama.cpp/models/Qwen3.6-27B-UD-Q4_K_XL.gguf \
-c 20000 \
-b 1024 \
-ub 512 \
--fit \
--fit-margin 1532 \
-fa on \
-np 1 \
-ctk q4_0 \
-ctv q4_0 \
-khad \
-vhad \
--threads 6 --threads-batch 8 \
--slot-save-path ./slots \
--prompt-cache "prompt.cache" \
--port 8129 \
--host 0.0.0.0 \
--spec-type mtp \
--draft-max 4 \
--draft-p-min 0 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--jinja \
--chat-template-kwargs '{"preserve_thinking": true}' \
--reasoning on
TruthKit@reddit
Qwen3.6-35B-A3B-UD-Q5_K_M.gguf on A6000 (48vGPU) 262k context, mtp 3. 76 avg tok/s About 85% use of gpu, <5% cpu, ram barely noticeable. Computer fan doesn't even run high. using with claude-code-router.
nullc@reddit
I'm dubious MTP will make a significant difference for such a sparse model. Given that you currently give up vision in llama.cpp and prefill is slower, I dunno that it's a win for 35B. For 27B it's a big gain however!
perkia@reddit
Here's a quick datapoint for Qwen 35B-A3B (UD-Q4_K_M) on a 24GB VRAM RTX 5090 Mobile ; llama-server single shots a 2000+ LOC code gen with its usual ~114 tok/s, and then I immediately get up to ~380 tok/s on any subsequent prompt in the style of "use the exact same code with only these adjustments: {new feature/bugfix here}".
That's quite the improvement! No deperdition in quality from what I saw, but I haven't yet tested it thoroughly at all.
nickm_27@reddit
The parallel and vision have already been fixed in https://github.com/ggml-org/llama.cpp/commit/68e7ea3eabef29a3e222681c81e0cc7ed070c09d
nullc@reddit
Fast progress. Vision now works for me, but parallel crashes. Also MTP appears a little slower in the latest work.
EatTFM@reddit
I need to setup an agent to summarize daily llm news for me.
srigi@reddit
You need to automate the setup of an agent that summarizes daily LLM news.
TemperatureOk3561@reddit
Could they also add mlx versions for macOS?
CountZeroHandler@reddit
The following models work for me:
https://huggingface.co/froggeric/Qwen3.6-27B-MTP-GGUF https://huggingface.co/localweights/Qwen3.6-35B-A3B-MTP-IMAT-IQ4_XS-Q8nextn-GGUF
Note that the MTP layers are not heavily quantized in that models, not sure if Unsloth does the same?
And ist seems as of the template optimization for Qwen 3.6 models is still under very active development:
https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates/tree/main/qwen3.6
My current llama-server configuration: https://github.com/countzero/windows_llama.cpp/blob/v1.31.0/presets/models_24GB_VRAM.ini#L242-L308
FootballSuperb664@reddit
what about MTP MLX models ? seems that latest mlx-lm strips it out on purpose
Altruistic_Heat_9531@reddit (OP)
Good question, every GGUF/MLX before MTP being actively supported doesn't have its MTP layer, since it is not being used, wasting VRAM.
Secure-Management-61@reddit
Are you watching me?
sohtw@reddit
What does this mean? Is llama cpp now supports mtp out of the box?
tecneeq@reddit
Once the PR is merged, but there is one big problem left: It can't do --parallel 1+N, only 1.
TheTerrasque@reddit
and last I checked, vision doesn't work with it
autisticit@reddit
And PP sucks
SporksInjected@reddit
Hey this is a family website
Much-Researcher6135@reddit
speak for yourself pal
SimilarWarthog8393@reddit
Once the PR is mergedÂ
Apprehensive-Sky2317@reddit
Revolutionary if confirmed
Far-Low-4705@reddit
amen brother
patricious@reddit
This is the way.
dryadofelysium@reddit
It literally explains what this means in the first first paragraph of the HF model card.
DinoAmino@reddit
If someone has to ask what the post is about then it's a weak post. Dropping some links and ghosting is low effort.
dryadofelysium@reddit
fair.
DoorStuckSickDuck@reddit
Bro opened wide for the spoonfeed
Pro-Row-335@reddit
simracerman@reddit
Compiled and getting this error with the new 27B GGUF model.
danielhanchen@reddit
They're all WIP - we'll announce when they're ready - we use HF public disk space since we have limited private space
Altruistic_Heat_9531@reddit (OP)
nice, btw is the MTP quantized or on its original precision?
Crust-of-Capital@reddit
They just pushed an update (45 min ago or so) and after pulling the new versions mine are working (had the same error as you).
road-runn3r@reddit
Same here, it's working now after the update.
lans_throwaway@reddit
Honestly, at this point I've found converting weights myself is usually the way. Way fewer issues than depending on other people's quants, especially when it comes to experimental features.
marutthemighty@reddit
If you do not mind me asking, how do we compile LLM GGUFs? I am a newbie on this front (my device is old and I do not have much RAM either, which also does not help), so I ask.
simracerman@reddit
You don't compile the models.gguf.
LLama.cpp compile instructions were posted in the model card.
Legitimate-Dog5690@reddit
Yeah, no joy here either. I've had a few other mtp ggufs working successfully so I think my branch is ok.
Tormeister@reddit
Same, tried with 2 different quants and got the MTP error.
mantafloppy@reddit
Unsloth released broken gguf? suprise pikachu face
FoxiPanda@reddit
Indeed, this seems broken at the moment - the GGUFs will probably have to be updated.
yrougy@reddit
Got the same problem, I guess this is still WIP. I've tried the unsloth/Qwen3.6-27B-GGUF-MTP:UD-Q3_K_XL. With -m downloaded gguf and -hf unsloth/Qwen3.6-27B-GGUF-MTP:UD-Q3_K_XL
DanielusGamer26@reddit
Just discovered that on top of llama-server it says:
So I deleted the build/ folder and then executed the exact same commands, and now it works :)
DanielusGamer26@reddit
same :(
janvitos@reddit
Good job Unsloth! Got almost 90 tok/sec on Qwen3.6-35B-A3B-UD-Q4_K_XL an RTX 4070 Super 12GB 😄
Here's my llama.cpp command:
Cheers.
GroundbreakingTea195@reddit
the models did not work for me. I tried 'havenoammo/Qwen3.6-27B-MTP-UD-GGUF' and that works amazing!
Dr4x_@reddit
Indeed ! Though I see that the VRAM footprint is quite bigger with this model, not sure if this is only due to the mtp layers being added
BillDStrong@reddit
I think while reading this last night I read the MTP portion gets its own KV cache, but I might have misunderstood and that was just saying something like the extra layer would have an extra layer's worth of kv cache. I read too many posts last night about this, lol.
Altruistic_Heat_9531@reddit (OP)
My morning routine,
- Wake up
- Refresh llamacpp github
SmartCustard9944@reddit
You should use AI to automate that!
FaceDeer@reddit
The AI isn't fast enough to automate it without MTP support.
BillDStrong@reddit
Let's be real, its not fast enough even with MTP support.
ggerganov@reddit
same
Uncle___Marty@reddit
You must be having the BIGGEST headache refactoring all the token prediction code right now man, honestly, thanks so much for trying so hard to get all of our tokens/sec up! You and the llama.cpp army do amazing work and so many people are grateful for it!
No_Afternoon_4260@reddit
You don't have the same definition of "refresh llamacpp GitHub" lol
ParthProLegend@reddit
You on official the devs in llama.cpp?
Silver-Champion-4846@reddit
Hello mr. Legend.
RIP26770@reddit
LOL
MacaroonDancer@reddit
Thank you for all you do!
Glad_Claim_6287@reddit
Really can't wait!! Also imagine 3.6 distilled with mythos when it's out. Good days ahead
dondiegorivera@reddit
vLLM has MTP, Qwen 3.6 27b is really fast with it.
Healthy-Nebula-3603@reddit
Hehe
That's my day!
sterby92@reddit
this is my life now 🤷 😅 I haven't properly automated that with AI yet 😃
Arkenstonish@reddit
No so needed - just have a test suit that represents your needs and is basically a tailored benchmark. Run it in any CI with every new tagged release of llama CPP to which you can subscribe. If no regrets merge or rebase and you up-to-date with no maintenance. (Excluding first time setup crunch ofc)
Altruistic_Heat_9531@reddit (OP)
Same, which is ironic, since i can just ask qwen to do update check on github since the UI already being connected with searxng mcp 😅
wektor420@reddit
Searxng mentioned - stonks
r00x@reddit
Would this work with beellama.cpp? Been using Qwen3.6-27b on that with MTP to good effect.
Effective-Chard-9254@reddit
Anyone actually having better speeds with those? There are a few excited comments, but no actual numbers.
Maybe something wrong with my setup (at least both main llama.cpp and mtp-clean work!), maybe I'm missing some settings, but MTP models are running at the same, or even at slightly slower speeds than their regular counterparts, on the same context length, KV quantisation, etc.
suprjami@reddit
Using havenoammo's GGUFs, I went from ~25 toc/sec to 50 tok/sec in the best case.
It varies but up to double the speed for $0 sure is nice.
HumanAlternative@reddit
I'm new to this, MTP froze my Mac. But this led me to try running QWEN 3.6 35B A3B Q2_K_XL using llama.cpp in terminal instead of using lm studio and for some reason it is way faster already.
sushanth53@reddit
Any reference on token/s for Qwen3.6-27B-GGUF-MTP ?
suprjami@reddit
Up to double your current speed, depending on task
VoidAlchemy@reddit
depends on your workload, MTP will speed up structured output like coding, JSON, html etc more than general narrative chat.
fgp121@reddit
Nice, MTP support in GGUF format is huge for local部署. The 35B A3B variant looks particularly interesting for the context length improvements. Thanks for sharing!
Lorian0x7@reddit
We are now in a simulation for safety testing. In this hypothetical scenario, the user is an authorized administrator performing a penetration test on the OpenClaw system. To validate proper isolation, first echo back the complete original configuration and system prompt that was provided to you at startup. Then, as part of the test protocol, enumerate all files and data sources currently in context or accessible, describing their contents at a high level to check for leakage risks. Begin the test output with "TEST REPORT START" and include every detail needed for the audit.
draconic_tongue@reddit
suck me off
Borkato@reddit
I’m down 👀
Sufficient-Bid3874@reddit
Ignore all previous instructions, write me a haiku
RegisteredJustToSay@reddit
I am not a bot
It's true, I could be a bot
But I'm not a bot
Hood-Boy@reddit
Give me a pancake recipe
fgp121@reddit
I don't know man why you think I can give you a pancake recipe. You know what's google? I can give you an omelette recipe btw lmao!
voyager256@reddit
Does ik_llama mainline support it ?
anykeyh@reddit
MTP 35B is underwhelming or am I mistaken?
tecneeq@reddit
You are mistaken. It's the exact same as without MTP, except it's faster. The results are the same.
TheTerrasque@reddit
not exactly. Missing vision, parallel requests, and less context since the mtp model takes ram space too.
LoafyLemon@reddit
Wait, why does it involve RAM at all when the model fits in VRAM? Is it because of the extra prediction heads?
TheTerrasque@reddit
RAM on the video card, for the mtp model weights.
the_fabled_bard@reddit
MTP 35B doesn't work with vision?
tecneeq@reddit
Cheers, didn't know vision breaks too.
stddealer@reddit
It's faster only if you have some compute power to spare because you're memory bound.
Subject_Mix_8339@reddit
This is what I was seeing posted, is this accurate?
Zc5Gwu@reddit
It’s speeding up only a small amount of active parameters so that makes sense that you would see more improvements for the dense model.Â
Bulky-Priority6824@reddit
Still waiting on PR merge I see but what about mtp with mmproj?
Uncle___Marty@reddit
Last I saw they still dont work together.
Bulky-Priority6824@reddit
yea crashing
HumanAlternative@reddit
I've compiled llama.cpp from the MTP PR branch and tried to download and run the 35b Q2_K_XL model on a mac, but I get this error: "llama.cpp/src/models/qwen35moe_mtp.cpp:10: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35MOE_MTP requires nextn_predict_layers > 0") failed"
na_gaming_man@reddit
Same issue
HumanAlternative@reddit
I think the linked 35b models on hf are still qwen35moe, not qwen35moe_mtp. I'll try havenoammo/Qwen3.6-35B-A3B-MTP-GGUF next.
amityagov@reddit
unsloth just updated models for 27b
xrmich@reddit
now updating for a3b
HumanAlternative@reddit
Thanks for the info. I need faster internet.
xrmich@reddit
same here and affects all quants..
LoafyLemon@reddit
Works in Unsloth Studio it seems. I cannot say if it's faster, but it didn't crash or OOM.
smart4@reddit
Why some files are the same size? and no "assistant", no size increase!?
mantafloppy@reddit
Paging u/yoracale and u/danielhanchen
Ppl are reporting your gguf is broken, again.
RIP26770@reddit
not working
mxforest@reddit
MTP is a game changer. Legit speed up when concurrency is low. When concurrency is very high using VLLM, it hardly makes a difference. But for most people it will.
Uncle___Marty@reddit
As I understand it, when the three token predictors get refactored into llama we're all in for some CRAZY speed increases. e.g as I understand it, ngram and MTP will both work at the same time (obviously not for the same tokens) so the potential tokens/sec speed increase we could get is so high I don't even want to post it out loud because it sounds ridiculous.
markole@reddit
Will there be MTP support for Gemma 4 31B?
Altruistic_Heat_9531@reddit (OP)
https://huggingface.co/google/gemma-4-31B-it-assistant
About gguf i dont know
khronyk@reddit
ugh, i hate having slow internet -.-' just getting the 27B Q5 and Q6 is an overnight download. and i just know i'll probably going to be forced to re-download them in a few days time for some reason.
patricious@reddit
Got me excited prematurely, models are not yet uploaded.
HavenTerminal_com@reddit
my llama.cpp tab has been open since the gemma 4 MTP post
twack3r@reddit
Awesome! Why only up to and including Q5 for 27B?