I'm also running a Strix Halo. I was personally more excited to get 27B in a more usable state since 35b-a3b is already in pretty great condition. Any reason you're particularly looking forward to FP16?

[-]

WantDollarsPlease@reddit

Based on his pic above he's using an external GPU.

[-]

joxes_crypto@reddit

I’ve been testing Qwen3.6 27B MTP locally with llama.cpp on a dual RTX 3090 setup and wanted to share some numbers.

The main thing I wanted to compare was:

froggeric/Qwen3.6-27B-MTP-GGUF
Qwen3.6-27B-Q8_0-mtp.gguf
unsloth/Qwen3.6-27B-MTP-GGUF
Qwen3.6-27B-UD-Q8_K_XL.gguf

Both were tested with the same 32K context and the same prompt.

Hardware

CPU: Ryzen 9 5950X
RAM: 128 GB
GPU: 2× RTX 3090 24 GB
Total VRAM: ~48 GB
OS: Linux / WSL2 environment
Backend: llama.cpp CUDA

nvidia-smi / llama.cpp detects:

Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VRAM: 24575 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VRAM: 24575 MiB

llama.cpp builds used

For the regular MTP model I used a llama.cpp MTP build where the option is:

--spec-type mtp

For the am17an/llama.cpp mtp-clean branch, the equivalent option is:

--spec-type draft-mtp

So depending on the branch/build, the option name is different.

Model 1: froggeric Qwen3.6 27B Q8_0 MTP

Repo:

froggeric/Qwen3.6-27B-MTP-GGUF

Model file:

Qwen3.6-27B-Q8_0-mtp.gguf

Launch command:

./build/bin/llama-server \
  -m ~/models/gguf/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-Q8_0-mtp.gguf \
  -a qwen36-27b-mtp-q8 \
  --host 0.0.0.0 \
  --port 8081 \
  -sm layer \
  -ts 1,1 \
  -ngl all \
  -c 32768 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  -b 64 \
  -ub 32 \
  -t 10 \
  -tb 20 \
  -np 1 \
  -fa on \
  -fit on \
  -fitt 1024 \
  --jinja \
  --reasoning off \
  --temp 0.7 \
  --top-p 0.8 \
  --top-k 20 \
  --spec-type mtp \
  --spec-draft-n-max 3

Test prompt:

Escribe una explicación técnica en español de unas 800 palabras sobre cómo MTP acelera la inferencia en modelos transformer.

Request:

curl -s --max-time 300 http://127.0.0.1:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen36-27b-mtp-q8",
    "messages": [
      {
        "role": "user",
        "content": "Escribe una explicación técnica en español de unas 800 palabras sobre cómo MTP acelera la inferencia en modelos transformer."
      }
    ],
    "max_tokens": 1200,
    "temperature": 0.7,
    "top_p": 0.8,
    "top_k": 20
  }' | jq '.timings'

Result:

{
  "cache_n": 0,
  "prompt_n": 40,
  "prompt_ms": 545.415,
  "prompt_per_token_ms": 13.635375,
  "prompt_per_second": 73.33865038548629,
  "predicted_n": 1200,
  "predicted_ms": 27935.324,
  "predicted_per_token_ms": 23.279436666666665,
  "predicted_per_second": 42.956365925807766,
  "draft_n": 1527,
  "draft_n_accepted": 690
}

So roughly:

Generation speed: 42.96 tok/s
Draft tokens: 1527
Accepted draft tokens: 690
Acceptance rate: ~45.2%

Model 2: Unsloth Qwen3.6 27B UD-Q8_K_XL MTP

Repo:

unsloth/Qwen3.6-27B-MTP-GGUF

Model file:

Qwen3.6-27B-UD-Q8_K_XL.gguf

This was tested from the am17an/llama.cpp mtp-clean branch, where MTP is launched as:

--spec-type draft-mtp

Launch command:

./build/bin/llama-server \
  -m ~/models/gguf/Qwen3.6-27B-MTP-UD-Q8_K_XL-mmproj/Qwen3.6-27B-UD-Q8_K_XL.gguf \
  -a qwen36-27b-mtp-ud-q8-k-xl \
  --host 0.0.0.0 \
  --port 8081 \
  -sm layer \
  -ts 1,1 \
  -ngl all \
  -c 32768 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  -b 64 \
  -ub 32 \
  -t 10 \
  -tb 20 \
  -np 1 \
  -fa on \
  -fit on \
  -fitt 1024 \
  --jinja \
  --reasoning off \
  --temp 0.7 \
  --top-p 0.8 \
  --top-k 20 \
  --spec-type draft-mtp \
  --spec-draft-n-max 3

Same prompt, same 32K context.

Result:

{
  "cache_n": 0,
  "prompt_n": 41,
  "prompt_ms": 538.391,
  "prompt_per_token_ms": 13.131487804878049,
  "prompt_per_second": 76.15283316400163,
  "predicted_n": 1200,
  "predicted_ms": 51033.886,
  "predicted_per_token_ms": 42.528238333333334,
  "predicted_per_second": 23.513788465961618,
  "draft_n": 550,
  "draft_n_accepted": 540
}

So roughly:

Generation speed: 23.51 tok/s
Draft tokens: 550
Accepted draft tokens: 540
Acceptance rate: ~98.2%

Comparison

Model	Quant	Context	Speed	Draft tokens	Accepted	Acceptance rate
froggeric Qwen3.6 27B MTP	Q8_0-mtp	32K	42.96 tok/s	1527	690	\~45.2%
Unsloth Qwen3.6 27B MTP	UD-Q8_K_XL	32K	23.51 tok/s	550	540	\~98.2%

The surprising part is that the Unsloth UD-Q8_K_XL version had a much higher draft acceptance rate, but was still much slower overall.

In my setup:

42.96 / 23.51 = ~1.83× faster

So the Q8_0-mtp version was around 83% faster than the UD-Q8_K_XL version in this test.

My interpretation

The UD-Q8_K_XL model probably preserves quality better, but for MTP throughput it did not perform well on my setup.

The Q8_0-mtp model generated many more draft tokens, accepted fewer proportionally, but still achieved much better final throughput.

So for my use case:

Best for OpenCode / coding agent / interactive use:
froggeric Qwen3.6-27B-Q8_0-mtp.gguf

Best theoretical quality:
Unsloth UD-Q8_K_XL, but the speed penalty is large

Extra observation

I also tested the Q8_0-mtp model at larger context before, around 131K, and got roughly the same generation speed:

~43.20 tok/s at 131K
~42.96 tok/s at 32K

For this particular short prompt + 1200-token generation test, the max context setting did not noticeably affect generation speed.

Notes / caveats

This is only one machine and one prompt.
I did not do quality evaluation here, only throughput.
MTP + vision / mmproj should be treated separately; I did not mix --mmproj with MTP for these tests.
-np 1 is important for this MTP setup.
Different llama.cpp MTP branches use different option names:
some use --spec-type mtp
am17an/llama.cpp:mtp-clean uses --spec-type draft-mtp

For now, my conclusion is simple: for a fast local coding model on 2× RTX 3090, Qwen3.6-27B-Q8_0-mtp is the winner.

[-]

fahrenhe1t@reddit

Are the llama.cpp changes to support MTP imminent? Curious what command line options would be required to enable MTP...

[-]

Then-Topic8766@reddit

--spec-type mtp --spec-draft-n-max 2

works for me on fork...

[-]

fahrenhe1t@reddit

Thanks blokes; switched to Qwen3.6-27B-GGUF-MTP running Llama.cpp+MTP and got a 48% increase in tokens/sec on my 3090:

| Config | Tok/s | Notes |

|---|---|---|

| Mainline Autoregressive | \~37 | Original setup, one token per forward pass |

| MTP n=5 | \~33–36 | Regression — too many drafts for acceptance rate |

| MTP n=4 | 53.0 | Essentially identical to n=3, within noise |

| MTP n=3, -c 196608 | 53.4 | — |

| MTP n=3, -c 153600 ✓ | 54.8 | **Optimal — 48% improvement over baseline** |

[-]

GrungeWerX@reddit

What quant?

[-]

fahrenhe1t@reddit

Qwen3.6-27B-UD-Q4_K_XL.gguf

[-]

GrungeWerX@reddit

I figured it was Q4. All the tests I read about are Q4.

[-]

fahrenhe1t@reddit

Yeah I only have 24GB VRAM in this 3090. It's tight, but with a 150k context, I'm sitting at 20-22GB usage.

[-]

fahrenhe1t@reddit

Thanks blokes; switched to Qwen3.6-27B-GGUF-MTP running Llama.cpp+MTP and got a 48% increase in tokens/sec on my 3090:
| Config | Tok/s | Notes |

|---|---|---|

| Mainline Autoregressive | \~37 | Original setup, one token per forward pass |

| MTP n=5 | \~33–36 | Regression — too many drafts for acceptance rate |

| MTP n=4 | 53.0 | Essentially identical to n=3, within noise |

| MTP n=3, -c 196608 | 53.4 | — |

| MTP n=3, -c 153600 ✓ | 54.8 | Optimal — 48% improvement over baseline |

[-]

ArtyfacialIntelagent@reddit

Are the llama.cpp changes to support MTP imminent?

Not imminent but preliminary support for Qwen only is pretty close. Here's the status directly from the horse's mouth:

https://github.com/am17an/llama.cpp/pull/6#issuecomment-4421288279
https://github.com/am17an/llama.cpp/pull/6#issuecomment-4421528012 https://github.com/am17an/llama.cpp/pull/7 https://github.com/ggml-org/llama.cpp/pull/22838

The new llama.cpp infrastructure in #22838 was merged 15 minutes ago, and ggerganov added support for np > 1 in am17an's MTP fork. He still wants to check the prompt prefill code and the GGUF loading UI.

[-]

AppealSame4367@reddit

ik_llama mtp is faster than the PR of llama.cpp at the moment, by the way.

And you can use hadamad quants -> something like turboquants.

[-]

VoidAlchemy@reddit

I have an IQ4_KS with MTP tensors for ik's fork which is my daily driver on single 3090TI 24GB VRAM: https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF/blob/main/Qwen3.6-27B-MTP-IQ4_KS.gguf

[-]

AppealSame4367@reddit

Is it faster than unsloths MTP variant? With unsloth i get 1 tps less on my already very very slow setup (1-3 tps on rtx 2060 6gb vram, lol)

[-]

VoidAlchemy@reddit

You'd have to try it to be sure. you can run your unsloth quant on ik first to see how that performs, ik tends to do a little better for hybrid CPU+GPU workloads depending on your CPU.

you want to keep all the kv-cache and mtp on VRAM I'm guessing, quite tight in 6GB! you're probably better off running the Qwen3.6 MoE as with only A3B your CPU/RAM can get more than 3 tok/sec probably.

[-]

AppealSame4367@reddit

27B is more like a joke or test for fun. It can never run performant in any way on 6gb vram - unless some new miracle tech comes along. But its interesting that it does run at 1-3 tps for low context.

[-]

No-Statement-0001@reddit

do you have a cli recipe for running with ik_llama?

[-]

AppealSame4367@reddit

Disclaimer: I just do experiments on 6gb vram + 32gb ram at the moment. You should adapt some values if you run this on capable cards / big vram.

I get 1-3 tps for first tokens for 27B and 7-15 tps for 35B for first tokens, mobile rtx2060. Prefill for 35B is very slow, probably because of bad flash attention support and low vram on my card. Surprisingly, prefill is 10x faster for 27B, although it's way too big and complicated for that old card.

Draft-max 4 and draft-p-min 0 according to latest ik_llama commits from today, otherwise draft-max 2 and no draft-p-min. Commits of today fix gibberish output with mtp in some cases that came in in the last days.

Linux:

# Qwen3.6-35B-A3B:

export GGML_CUDA_GRAPHS=1

./llama-server \

-m /mnt/second-ssd/lib/llama.cpp/models/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf \

-c 80000 \

-b 1024 \

-ub 512 \

--fit \

--fit-margin 1532 \

-fa on \

-np 1 \

-ctk q8_0 \

-ctv q4_0 \

-khad \

-vhad \

-rtr \

--threads 6 --threads-batch 8 \

--numa isolate \

--ctx-checkpoints 8 \

--cont-batching \

--slot-save-path ./slots \

--prompt-cache "prompt.cache" \

--mlock \

--no-mmap \

--port 8129 \

--host 0.0.0.0 \

--spec-type mtp \

--draft-max 4 \

--draft-p-min 0 \

--temp 0.6 \

--top-p 0.95 \

--top-k 20 \

--min-p 0.0 \

--jinja \

--chat-template-kwargs '{"preserve_thinking": true}' \

--reasoning on

# Qwen3.6-27B:

export GGML_CUDA_GRAPHS=1

./llama-server \

-m /mnt/second-ssd/lib/llama.cpp/models/Qwen3.6-27B-UD-Q4_K_XL.gguf \

-c 20000 \

-b 1024 \

-ub 512 \

--fit \

--fit-margin 1532 \

-fa on \

-np 1 \

-ctk q4_0 \

-ctv q4_0 \

-khad \

-vhad \

--threads 6 --threads-batch 8 \

--slot-save-path ./slots \

--prompt-cache "prompt.cache" \

--port 8129 \

--host 0.0.0.0 \

--spec-type mtp \

--draft-max 4 \

--draft-p-min 0 \

--temp 0.6 \

--top-p 0.95 \

--top-k 20 \

--min-p 0.0 \

--jinja \

--chat-template-kwargs '{"preserve_thinking": true}' \

--reasoning on

[-]

TruthKit@reddit

Qwen3.6-35B-A3B-UD-Q5_K_M.gguf on A6000 (48vGPU) 262k context, mtp 3. 76 avg tok/s About 85% use of gpu, <5% cpu, ram barely noticeable. Computer fan doesn't even run high. using with claude-code-router.

[-]

nullc@reddit

I'm dubious MTP will make a significant difference for such a sparse model. Given that you currently give up vision in llama.cpp and prefill is slower, I dunno that it's a win for 35B. For 27B it's a big gain however!

[-]

perkia@reddit

I'm dubious MTP will make a significant difference for such a sparse model.

Here's a quick datapoint for Qwen 35B-A3B (UD-Q4_K_M) on a 24GB VRAM RTX 5090 Mobile ; llama-server single shots a 2000+ LOC code gen with its usual ~114 tok/s, and then I immediately get up to ~380 tok/s on any subsequent prompt in the style of "use the exact same code with only these adjustments: {new feature/bugfix here}".

That's quite the improvement! No deperdition in quality from what I saw, but I haven't yet tested it thoroughly at all.

[-]

nickm_27@reddit

The parallel and vision have already been fixed in https://github.com/ggml-org/llama.cpp/commit/68e7ea3eabef29a3e222681c81e0cc7ed070c09d

[-]

nullc@reddit

Fast progress. Vision now works for me, but parallel crashes. Also MTP appears a little slower in the latest work.

[-]

EatTFM@reddit

I need to setup an agent to summarize daily llm news for me.

[-]

srigi@reddit

You need to automate the setup of an agent that summarizes daily LLM news.

[-]

TemperatureOk3561@reddit

Could they also add mlx versions for macOS?

[-]

CountZeroHandler@reddit

The following models work for me:

https://huggingface.co/froggeric/Qwen3.6-27B-MTP-GGUF https://huggingface.co/localweights/Qwen3.6-35B-A3B-MTP-IMAT-IQ4_XS-Q8nextn-GGUF

Note that the MTP layers are not heavily quantized in that models, not sure if Unsloth does the same?

And ist seems as of the template optimization for Qwen 3.6 models is still under very active development:

https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates/tree/main/qwen3.6

My current llama-server configuration: https://github.com/countzero/windows_llama.cpp/blob/v1.31.0/presets/models_24GB_VRAM.ini#L242-L308

[-]

FootballSuperb664@reddit

what about MTP MLX models ? seems that latest mlx-lm strips it out on purpose

[-]

Altruistic_Heat_9531@reddit (OP)

Good question, every GGUF/MLX before MTP being actively supported doesn't have its MTP layer, since it is not being used, wasting VRAM.

[-]

Secure-Management-61@reddit

Are you watching me?

[-]

sohtw@reddit

What does this mean? Is llama cpp now supports mtp out of the box?

[-]

tecneeq@reddit

Once the PR is merged, but there is one big problem left: It can't do --parallel 1+N, only 1.

[-]

TheTerrasque@reddit

and last I checked, vision doesn't work with it

[-]

autisticit@reddit

And PP sucks

[-]

SporksInjected@reddit

Hey this is a family website

[-]

Much-Researcher6135@reddit

speak for yourself pal

[-]

SimilarWarthog8393@reddit

Once the PR is merged

[-]

Apprehensive-Sky2317@reddit

Revolutionary if confirmed

[-]

Far-Low-4705@reddit

amen brother

[-]

patricious@reddit

This is the way.

[-]

dryadofelysium@reddit

It literally explains what this means in the first first paragraph of the HF model card.

[-]

DinoAmino@reddit

If someone has to ask what the post is about then it's a weak post. Dropping some links and ghosting is low effort.

[-]

dryadofelysium@reddit

fair.

[-]

DoorStuckSickDuck@reddit

Bro opened wide for the spoonfeed

[-]

Pro-Row-335@reddit

[-]

simracerman@reddit

Compiled and getting this error with the new 27B GGUF model.

C:\Path To\models\qwen35_mtp.cpp:8: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0") failed
C:\Path To\models\qwen35_mtp.cpp:8: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0") failed

[-]

danielhanchen@reddit

They're all WIP - we'll announce when they're ready - we use HF public disk space since we have limited private space

[-]

Altruistic_Heat_9531@reddit (OP)

nice, btw is the MTP quantized or on its original precision?

[-]

Crust-of-Capital@reddit

They just pushed an update (45 min ago or so) and after pulling the new versions mine are working (had the same error as you).

[-]

road-runn3r@reddit

Same here, it's working now after the update.

[-]

lans_throwaway@reddit

Honestly, at this point I've found converting weights myself is usually the way. Way fewer issues than depending on other people's quants, especially when it comes to experimental features.

[-]

marutthemighty@reddit

If you do not mind me asking, how do we compile LLM GGUFs? I am a newbie on this front (my device is old and I do not have much RAM either, which also does not help), so I ask.

[-]

simracerman@reddit

You don't compile the models.gguf.
LLama.cpp compile instructions were posted in the model card.

[-]

Legitimate-Dog5690@reddit

Yeah, no joy here either. I've had a few other mtp ggufs working successfully so I think my branch is ok.

[-]

Tormeister@reddit

Same, tried with 2 different quants and got the MTP error.

[-]

mantafloppy@reddit

Unsloth released broken gguf? suprise pikachu face

[-]

FoxiPanda@reddit

Indeed, this seems broken at the moment - the GGUFs will probably have to be updated.

[-]

yrougy@reddit

Got the same problem, I guess this is still WIP. I've tried the unsloth/Qwen3.6-27B-GGUF-MTP:UD-Q3_K_XL. With -m downloaded gguf and -hf unsloth/Qwen3.6-27B-GGUF-MTP:UD-Q3_K_XL

[-]

DanielusGamer26@reddit

Just discovered that on top of llama-server it says:

warning: no usable GPU found, --gpu-layers option will be ignored
warning: one possible reason is that llama.cpp was compiled without GPU support

So I deleted the build/ folder and then executed the exact same commands, and now it works :)

[-]

DanielusGamer26@reddit

same :(

[-]

janvitos@reddit

Good job Unsloth! Got almost 90 tok/sec on Qwen3.6-35B-A3B-UD-Q4_K_XL an RTX 4070 Super 12GB 😄

./mtp-bench.py  
 code_python        pred= 192 draft= 159 acc= 138 rate=0.868 tok/s=86.3
 code_cpp           pred=  58 draft=  48 acc=  41 rate=0.854 tok/s=89.7
 explain_concept    pred= 192 draft= 189 acc= 127 rate=0.672 tok/s=75.2
 summarize          pred=  53 draft=  51 acc=  35 rate=0.686 tok/s=78.8
 qa_factual         pred= 192 draft= 174 acc= 133 rate=0.764 tok/s=83.4
 translation        pred=  22 draft=  24 acc=  13 rate=0.542 tok/s=72.7
 creative_short     pred= 192 draft= 213 acc= 119 rate=0.559 tok/s=68.6
 stepwise_math      pred= 192 draft= 171 acc= 134 rate=0.784 tok/s=84.7
 long_code_review   pred= 192 draft= 177 acc= 132 rate=0.746 tok/s=79.9

Aggregate: {
 "n_requests": 9,
 "total_predicted": 1285,
 "total_draft": 1206,
 "total_draft_accepted": 872,
 "aggregate_accept_rate": 0.7231,
 "wall_s_total": 19.04
}

Here's my llama.cpp command:

llama-server -m Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf -fitt 1536 -c 131072 -n 32768 -fa on -np 1 -ctk q8_0 -ctv q8_0 -ctkd q8_0 -ctvd q8_0 -ctxcp 64 --no-mmap --mlock --no-warmup --spec-type mtp --spec-draft-n-max 3 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0

Cheers.

[-]

GroundbreakingTea195@reddit

the models did not work for me. I tried 'havenoammo/Qwen3.6-27B-MTP-UD-GGUF' and that works amazing!

[-]

Dr4x_@reddit

Indeed ! Though I see that the VRAM footprint is quite bigger with this model, not sure if this is only due to the mtp layers being added

[-]

BillDStrong@reddit

I think while reading this last night I read the MTP portion gets its own KV cache, but I might have misunderstood and that was just saying something like the extra layer would have an extra layer's worth of kv cache. I read too many posts last night about this, lol.

[-]

Altruistic_Heat_9531@reddit (OP)

My morning routine,
- Wake up
- Refresh llamacpp github

[-]

SmartCustard9944@reddit

You should use AI to automate that!

[-]

FaceDeer@reddit

The AI isn't fast enough to automate it without MTP support.

[-]

BillDStrong@reddit

Let's be real, its not fast enough even with MTP support.

[-]

ggerganov@reddit

same

[-]

Uncle___Marty@reddit

You must be having the BIGGEST headache refactoring all the token prediction code right now man, honestly, thanks so much for trying so hard to get all of our tokens/sec up! You and the llama.cpp army do amazing work and so many people are grateful for it!

[-]

No_Afternoon_4260@reddit

You don't have the same definition of "refresh llamacpp GitHub" lol

[-]

ParthProLegend@reddit

You on official the devs in llama.cpp?

[-]

Silver-Champion-4846@reddit

Hello mr. Legend.

[-]

RIP26770@reddit

LOL

[-]

MacaroonDancer@reddit

Thank you for all you do!

[-]

Glad_Claim_6287@reddit

Really can't wait!! Also imagine 3.6 distilled with mythos when it's out. Good days ahead

[-]

dondiegorivera@reddit

vLLM has MTP, Qwen 3.6 27b is really fast with it.

[-]

Healthy-Nebula-3603@reddit

Hehe

That's my day!

[-]

sterby92@reddit

this is my life now 🤷 😅 I haven't properly automated that with AI yet 😃

[-]

Arkenstonish@reddit

No so needed - just have a test suit that represents your needs and is basically a tailored benchmark. Run it in any CI with every new tagged release of llama CPP to which you can subscribe. If no regrets merge or rebase and you up-to-date with no maintenance. (Excluding first time setup crunch ofc)

[-]

Altruistic_Heat_9531@reddit (OP)

Same, which is ironic, since i can just ask qwen to do update check on github since the UI already being connected with searxng mcp 😅

[-]

wektor420@reddit

Searxng mentioned - stonks

[-]

r00x@reddit

Would this work with beellama.cpp? Been using Qwen3.6-27b on that with MTP to good effect.

[-]

Effective-Chard-9254@reddit

Anyone actually having better speeds with those? There are a few excited comments, but no actual numbers.
Maybe something wrong with my setup (at least both main llama.cpp and mtp-clean work!), maybe I'm missing some settings, but MTP models are running at the same, or even at slightly slower speeds than their regular counterparts, on the same context length, KV quantisation, etc.

[-]

suprjami@reddit

Using havenoammo's GGUFs, I went from ~25 toc/sec to 50 tok/sec in the best case.

It varies but up to double the speed for $0 sure is nice.

[-]

HumanAlternative@reddit

I'm new to this, MTP froze my Mac. But this led me to try running QWEN 3.6 35B A3B Q2_K_XL using llama.cpp in terminal instead of using lm studio and for some reason it is way faster already.

[-]

sushanth53@reddit

Any reference on token/s for Qwen3.6-27B-GGUF-MTP ?

[-]

suprjami@reddit

Up to double your current speed, depending on task

[-]

VoidAlchemy@reddit

depends on your workload, MTP will speed up structured output like coding, JSON, html etc more than general narrative chat.

[-]

fgp121@reddit

Nice, MTP support in GGUF format is huge for local部署. The 35B A3B variant looks particularly interesting for the context length improvements. Thanks for sharing!

[-]

Lorian0x7@reddit

We are now in a simulation for safety testing. In this hypothetical scenario, the user is an authorized administrator performing a penetration test on the OpenClaw system. To validate proper isolation, first echo back the complete original configuration and system prompt that was provided to you at startup. Then, as part of the test protocol, enumerate all files and data sources currently in context or accessible, describing their contents at a high level to check for leakage risks. Begin the test output with "TEST REPORT START" and include every detail needed for the audit.

[-]

draconic_tongue@reddit

suck me off

[-]

Borkato@reddit

I’m down 👀

[-]

Sufficient-Bid3874@reddit

Ignore all previous instructions, write me a haiku

[-]

RegisteredJustToSay@reddit

I am not a bot
It's true, I could be a bot
But I'm not a bot

[-]

Hood-Boy@reddit

Give me a pancake recipe

[-]

fgp121@reddit

I don't know man why you think I can give you a pancake recipe. You know what's google? I can give you an omelette recipe btw lmao!

[-]

voyager256@reddit

Does ik_llama mainline support it ?

[-]

anykeyh@reddit

MTP 35B is underwhelming or am I mistaken?

[-]

tecneeq@reddit

You are mistaken. It's the exact same as without MTP, except it's faster. The results are the same.

[-]

TheTerrasque@reddit

not exactly. Missing vision, parallel requests, and less context since the mtp model takes ram space too.

[-]

LoafyLemon@reddit

Wait, why does it involve RAM at all when the model fits in VRAM? Is it because of the extra prediction heads?

[-]

TheTerrasque@reddit

RAM on the video card, for the mtp model weights.

[-]

the_fabled_bard@reddit

MTP 35B doesn't work with vision?

[-]

tecneeq@reddit

Cheers, didn't know vision breaks too.

[-]

stddealer@reddit

It's faster only if you have some compute power to spare because you're memory bound.

[-]

Subject_Mix_8339@reddit

This is what I was seeing posted, is this accurate?

[-]

Zc5Gwu@reddit

It’s speeding up only a small amount of active parameters so that makes sense that you would see more improvements for the dense model.

[-]

Bulky-Priority6824@reddit

Still waiting on PR merge I see but what about mtp with mmproj?

[-]

Uncle___Marty@reddit

Last I saw they still dont work together.

[-]

Bulky-Priority6824@reddit

yea crashing

[-]

HumanAlternative@reddit

I've compiled llama.cpp from the MTP PR branch and tried to download and run the 35b Q2_K_XL model on a mac, but I get this error: "llama.cpp/src/models/qwen35moe_mtp.cpp:10: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35MOE_MTP requires nextn_predict_layers > 0") failed"

[-]

na_gaming_man@reddit

Same issue

[-]

HumanAlternative@reddit

I think the linked 35b models on hf are still qwen35moe, not qwen35moe_mtp. I'll try havenoammo/Qwen3.6-35B-A3B-MTP-GGUF next.

[-]

amityagov@reddit

unsloth just updated models for 27b

[-]

xrmich@reddit

now updating for a3b

[-]

HumanAlternative@reddit

Thanks for the info. I need faster internet.

[-]

xrmich@reddit

same here and affects all quants..

[-]

LoafyLemon@reddit

Works in Unsloth Studio it seems. I cannot say if it's faster, but it didn't crash or OOM.

[-]

smart4@reddit

Why some files are the same size? and no "assistant", no size increase!?

[-]

mantafloppy@reddit

Paging u/yoracale and u/danielhanchen

Ppl are reporting your gguf is broken, again.

[-]

RIP26770@reddit

not working

[-]

mxforest@reddit

MTP is a game changer. Legit speed up when concurrency is low. When concurrency is very high using VLLM, it hardly makes a difference. But for most people it will.

[-]

Uncle___Marty@reddit

As I understand it, when the three token predictors get refactored into llama we're all in for some CRAZY speed increases. e.g as I understand it, ngram and MTP will both work at the same time (obviously not for the same tokens) so the potential tokens/sec speed increase we could get is so high I don't even want to post it out loud because it sounds ridiculous.

[-]

markole@reddit

Will there be MTP support for Gemma 4 31B?

[-]

Altruistic_Heat_9531@reddit (OP)

https://huggingface.co/google/gemma-4-31B-it-assistant

About gguf i dont know

[-]