Qwen 3.6 27B MTP on v100 32GB: 54 t/s

Posted by m94301@reddit | LocalLLaMA | View on Reddit | 15 comments

Just a quick note that I got a nice result using am17an's MTP branch of llama.cpp on v100 32GB SXM module using one of those pcie card adapters. Pulled and built in one shot, and llama-server ran without a hitch.

Tested using am17an's MTP GGUF, q8_0 kv cache and 200k cache limit acting as vscode copilot.

29-30 t/s without MTP

54-55t/s with MTP, using 150W power limit on the card.

Falls to 40-45 t/s after choking down 50k tokens, but doing great with tool calls, sub agents, and made some very insightful code reviews and refactors.

Thank you am17an! Can't wait to see this branch mature, this is great stuff.

[-]

m94301@reddit (OP)

Just for reference, I get 105-110 t/s on the 35B MOE, same basic setup (MTP 3) and identical card.

I do like the MOE, but it is not as good at coding and it did trap itself in building async calls, bouncing back and forth in endless loop. So, mostly I use 27B for code and 35B for quick reviews or junior level patches. That it is fine at, and very quick.

[-]

burger4d@reddit

When the MTP features get merged into the main branch, will any flags need to be added to the llama.cpp command to turn it on?

[-]

ArtfulGenie69@reddit

More than flags, odds are that the gguf model you are using gutted the mtp layers. This means you'll need a model with those intact or maybe they will have a solution for qwen where you can side load them. Probably want to leave them full bf16 or whatever though.

[-]

rog-uk@reddit

Correct me if I am wrong, but wouldn't one expect Qwen3-Coder-Next 80B-A3B to run about as quickly? Did you use TurboQuant, and if so was it useful to you (or make things worse)? Thanks for any reply :-)

[-]

QuackerEnte@reddit

Quick question, if one has low VRAM and the DENSE model spills into RAM, does MTP even speed anything up? or would it rather slow things down here, as it needs to verify a batch of 4 tokens using the WHOLE model anyway? I never really got the intuition for it. speculative decoding is more or less the same, no?

[-]

ixdx@reddit

I ran several tests and I see a noticeable drop in pp with MTP.

RTX 5070 Ti + RTX 5060 Ti

Qwen3.6-27B-Q4_K_L-MTP
tg: 38-61
pp: 528-804

bartowski/Qwen3.6-27B-Q4_K_L
tg: 22-27
pp: 1155-1713

I created the GGUF using an imatrix from Bartowski.

./convert_hf_to_gguf.py /models/Qwen/Qwen3.6-27B --outfile /models/Qwen3.6-27B-BF16-MTP.gguf --outtype bf16
llama-quantize \
  --output-tensor-type Q8_0 \
  --token-embedding-type Q8_0 \
  --tensor-type ssm_out=Q8_0 \
  --imatrix bartowski_Qwen_Qwen3.6-27B-imatrix.gguf \
  /models/Qwen3.6-27B-BF16-MTP.gguf \
  /models/Qwen3.6-27B-Q4_K_L-MTP.gguf \
  Q4_K

For a test with a long context, I asked to analyze approximately 10,000 lines from /var/log/syslog

bartowski/Qwen3.6-27B-Q4_K_L

prompt eval time =  106266.92 ms / 130439 tokens (    0.81 ms per token,  1227.47 tokens per second)
       eval time =   32149.55 ms /   633 tokens (   50.79 ms per token,    19.69 tokens per second)
      total time =  138416.47 ms / 131072 tokens

Qwen3.6-27B-Q4_K_L-MTP

prompt eval time =  250442.42 ms / 130439 tokens (    1.92 ms per token,   520.83 tokens per second)
       eval time =   17058.53 ms /   633 tokens (   26.95 ms per token,    37.11 tokens per second)
      total time =  267500.95 ms / 131072 tokens

llama-server args

--mlock --no-mmap --flash-attn on --jinja -ctk f16 -ctv f16 -dev CUDA0,CUDA1 -c 131072 -fitc 131072 -fit on -fitt 384 -ts 130,100 -m /models/my/Qwen3.6-27B-Q4_K_L-MTP.gguf --reasoning off --temp 0.7 --top-p 0.8  --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 --spec-type mtp --spec-draft-n-max 3 --parallel 1

[-]

NoBee9598@reddit

Which HF repo do you use for this speed

[-]

Enough_Big4191@reddit

54 t/s on a v100 still feels kind of absurd honestly. the bigger thing for me is when these setups stay reliable once agents start chaining tool calls and chewing through messy context windows for hours. have u noticed any weird degradation in output quality after the longer contexts, or mostly just throughput drop?

[-]

tomByrer@reddit

Unlike DFlash where you use a smaller model for predictions, MTP uses the same model file. So if anything, should decrease degradation, since it is 'exploring' more the same model.

[-]

Karyo_Ten@reddit

Dflash and speculative decoding in general should be lossless.

[-]

LeatherRub7248@reddit

40t/s sounds very doable.

Could you share pp / ttft time as well pls?

[-]

rm-rf-rm@reddit

PP/TTFT should stay unchanged no?

[-]

Daniel_H212@reddit

How's the performance at longer context?

[-]

orionstein@reddit

Are there any other settings to note? Context limit? What quant are you using?

[-]

m94301@reddit (OP)

Hi,

I set k and v to q8_0 so I could bump up to ctx 200000. For me, it reasoned well although the excessive thinking of this model should hide a lot of quantization warts.

I used am17an's GGUF, I believe it is q4 based.

I am using the MTP 3 just as in the example. I didn't try more or less guesses but will try it tomorrow.

Other than that, kind of stock settings. Batch 2048.

I did try mixed f16/q8 on cache and that locked up, but that's a pretty corner case and not a good idea for beta stuff.