Qwen 3.6 27B MTP on v100 32GB: 54 t/s
Posted by m94301@reddit | LocalLLaMA | View on Reddit | 15 comments
Just a quick note that I got a nice result using am17an's MTP branch of llama.cpp on v100 32GB SXM module using one of those pcie card adapters. Pulled and built in one shot, and llama-server ran without a hitch.
Tested using am17an's MTP GGUF, q8_0 kv cache and 200k cache limit acting as vscode copilot.
29-30 t/s without MTP
54-55t/s with MTP, using 150W power limit on the card.
Falls to 40-45 t/s after choking down 50k tokens, but doing great with tool calls, sub agents, and made some very insightful code reviews and refactors.
Thank you am17an! Can't wait to see this branch mature, this is great stuff.
m94301@reddit (OP)
Just for reference, I get 105-110 t/s on the 35B MOE, same basic setup (MTP 3) and identical card.
I do like the MOE, but it is not as good at coding and it did trap itself in building async calls, bouncing back and forth in endless loop. So, mostly I use 27B for code and 35B for quick reviews or junior level patches. That it is fine at, and very quick.
burger4d@reddit
When the MTP features get merged into the main branch, will any flags need to be added to the llama.cpp command to turn it on?
ArtfulGenie69@reddit
More than flags, odds are that the gguf model you are using gutted the mtp layers. This means you'll need a model with those intact or maybe they will have a solution for qwen where you can side load them. Probably want to leave them full bf16 or whatever though.
rog-uk@reddit
Correct me if I am wrong, but wouldn't one expect Qwen3-Coder-Next 80B-A3B to run about as quickly? Did you use TurboQuant, and if so was it useful to you (or make things worse)? Thanks for any reply :-)
QuackerEnte@reddit
Quick question, if one has low VRAM and the DENSE model spills into RAM, does MTP even speed anything up? or would it rather slow things down here, as it needs to verify a batch of 4 tokens using the WHOLE model anyway? I never really got the intuition for it. speculative decoding is more or less the same, no?
ixdx@reddit
I ran several tests and I see a noticeable drop in pp with MTP.
RTX 5070 Ti + RTX 5060 Ti
I created the GGUF using an imatrix from Bartowski.
For a test with a long context, I asked to analyze approximately 10,000 lines from /var/log/syslog
bartowski/Qwen3.6-27B-Q4_K_L
Qwen3.6-27B-Q4_K_L-MTP
llama-server args
NoBee9598@reddit
Which HF repo do you use for this speed
Enough_Big4191@reddit
54 t/s on a v100 still feels kind of absurd honestly. the bigger thing for me is when these setups stay reliable once agents start chaining tool calls and chewing through messy context windows for hours. have u noticed any weird degradation in output quality after the longer contexts, or mostly just throughput drop?
tomByrer@reddit
Unlike DFlash where you use a smaller model for predictions, MTP uses the same model file. So if anything, should decrease degradation, since it is 'exploring' more the same model.
Karyo_Ten@reddit
Dflash and speculative decoding in general should be lossless.
LeatherRub7248@reddit
40t/s sounds very doable.
Could you share pp / ttft time as well pls?
rm-rf-rm@reddit
PP/TTFT should stay unchanged no?
Daniel_H212@reddit
How's the performance at longer context?
orionstein@reddit
Are there any other settings to note? Context limit? What quant are you using?
m94301@reddit (OP)
Hi,
I set k and v to q8_0 so I could bump up to ctx 200000. For me, it reasoned well although the excessive thinking of this model should hide a lot of quantization warts.
I used am17an's GGUF, I believe it is q4 based.
I am using the MTP 3 just as in the example. I didn't try more or less guesses but will try it tomorrow.
Other than that, kind of stock settings. Batch 2048.
I did try mixed f16/q8 on cache and that locked up, but that's a pretty corner case and not a good idea for beta stuff.