Move to backend sampling for MTP draft path by gaugarg-nv · Pull Request #23287 · ggml-org/llama.cpp
Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 36 comments
faster mtp
libregrape@reddit
I am recompiling llama cpp for third time today.
What a time to be alive!
unbannedfornothing@reddit
Compile time is only half of the pain, the other half is model loading (for me loading of 397b qwen with no-mmap and mlock takes like 20min+)
segmond@reddit
buy a good NVME drive.
Bulky-Priority6824@reddit
be careful. just because something is added doesnt mean something else won't break. observed seriosu regression in TG on latest MTP cleanup buiild 9235
Terrible-Detail-1364@reddit
same here, initial ctx would fit (eg 128k) but now have to make it smaller and quantize the draft cache to load the model and its about 10-15 tp/s slower
Bulky-Priority6824@reddit
Grab b9253
Terrible-Detail-1364@reddit
ty, the initial merge is the one working well version: 9180 (255582687) will try this.
anthonyg45157@reddit
I noticed this as well. I think it has something to do with draft p min defaulting to 0 and not being used in early builds but now it is so if you have that set it could be the issue...I'm still noticing some slowdown compared to the original merge on top of that it seems.
Bulky-Priority6824@reddit
yea im staying on b9202
Fabulous_Fact_606@reddit
i've been going back in forth on llama and vllm... llama 27B UD-Q8_K_XL at 20-30t/s
or vllm Qwen3.6-27B INT8 AutoRound at 50-70 t/s on 3090x2. I need precise math and coding. vllm is winning...
ArtfulGenie69@reddit
It would win on a 3090. Don't have nice things like int8 in gguf land. I've got the same setup to try. Vllm is gonna be so much better. How's your pp speed compare? Like double as fast just like tg in vllm?
Fabulous_Fact_606@reddit
philmarcracken@reddit
'I'm doing millions of calculations a second and they're all wrong!'
EbbNorth7735@reddit
Do i need a special build of vLLM? Currently on Windows
lemondrops9@reddit
I've been debating on going vllm for coding as well. What is your PCIe bus speed for the 2nd card?
Fabulous_Fact_606@reddit
x8/x8. on a x870E chipset.
No_Lingonberry1201@reddit
Price of living on the edge 😎
ML-Future@reddit
Only three?
Valuable_Touch5670@reddit
I think the rapid development + the vibrancy of its developer community really beats the crap out of other inferencing engines. THIS is a prime example.
Anbeeld@reddit
...except llama.cpp was behind other engines by like a month, as they had MTP for quite some time already?
jacek2023@reddit (OP)
I still don't have skills to run vllm with better performance than llama.cpp on my setup (3x3090). Could you give me some tips how to run Gemma or qwen with 200000 context?
ohhi23021@reddit
i just use club-3090 with a few adjustments. DFlash is faster for coding than MTP. last i tried 2 days ago the llama.cpp crashed with mtp + tensor parallel.
jacek2023@reddit (OP)
what is your context length?
New_Comfortable7240@reddit
Well difference is in llama.cpp they are more careful about long term stability. So the idea is while implemented this feature would be more stable than in other projects. Also, llama.cpp have a wider support, for example my p40 are not supported in other projects, so for a project that big and with so much reach in support is normal to take their time adding features
Anbeeld@reddit
All of this might be true, but the original claim was "rapid development ... beats the crap out of other inferencing engines". Meanwhile I was trying out Qwen 3.6 27B MTP literally a month ago with vLLM, and I'd guess they advanced their implementation quite a bit since then too.
Besides, currently folks here rebuild their llama.cpp 3 times per day to get the latest fixes, so it's not like they shipped MTP in a finished "long term stablity" form.
LetsGoBrandon4256@reddit
llama.cpp bros would wait months for a buggy new feature than admitting their precious inferencing engine is falling behind the forks.
Now ask them if they have TurboQuants yet lmao.
Anbeeld@reddit
Obviously no one ever in the history of inference might need cache quants below 4 bit, so why would they need it? Proceeds to quote what GG wrote in some random PR like it's a fucking Bible
Mount_Gamer@reddit
Has context got better with 16GB vram cards?
I can see the speedup but the context means dropping to really low quants.
For instance, the 27B qwen 3.6, I seem to only get 50k at a Q2 Quant... Of course this could be user error, but I did follow the flags recommended by unsloth. I think at Q3 ~ 12.8k ctx.
ea_man@reddit
OMG do I have to run benchmarks again to re optimize settings?
:D
bonobomaster@reddit
Oh come on, just admit you love it! ;D
czktcx@reddit
backend sampling will increase compute buffer usage(main model and mtp)...
cleversmoke@reddit
Another 6-7% performance boost?? I shall rebuild. Thank you!
cleversmoke@reddit
Just tested, got a ~5-6% performance boost on my RTX 3090 24G. Averaging 22mins on a 85k context process, vs 23 mins prior. Thanks!
yami_no_ko@reddit
With all those frequent changes in the mtp-flags of llama-server I went over to generally load its entire help page into an LLM context just to generate a valid startup command. :D
Sisuuu@reddit
What does this mean in practice?
iportnov@reddit
That gives ridiculous 200 t/s for Qwen 35B A3B on mobile 5090.
The only thing, with MTP enabled I'm getting CUDA timeouts from time to time :/ Probably for this card the load becomes too high...