Move to backend sampling for MTP draft path by gaugarg-nv · Pull Request #23287 · ggml-org/llama.cpp

[-]

libregrape@reddit

I am recompiling llama cpp for third time today.

What a time to be alive!

[-]

unbannedfornothing@reddit

Compile time is only half of the pain, the other half is model loading (for me loading of 397b qwen with no-mmap and mlock takes like 20min+)

[-]

Bulky-Priority6824@reddit

be careful. just because something is added doesnt mean something else won't break. observed seriosu regression in TG on latest MTP cleanup buiild 9235

[-]

Terrible-Detail-1364@reddit

same here, initial ctx would fit (eg 128k) but now have to make it smaller and quantize the draft cache to load the model and its about 10-15 tp/s slower

[-]

Terrible-Detail-1364@reddit

ty, the initial merge is the one working well version: 9180 (255582687) will try this.

[-]

I noticed this as well. I think it has something to do with draft p min defaulting to 0 and not being used in early builds but now it is so if you have that set it could be the issue...I'm still noticing some slowdown compared to the original merge on top of that it seems.

[-]

Bulky-Priority6824@reddit

yea im staying on b9202

[-]

Fabulous_Fact_606@reddit

i've been going back in forth on llama and vllm... llama 27B UD-Q8_K_XL at 20-30t/s

or vllm Qwen3.6-27B INT8 AutoRound at 50-70 t/s on 3090x2. I need precise math and coding. vllm is winning...

[-]

ArtfulGenie69@reddit

It would win on a 3090. Don't have nice things like int8 in gguf land. I've got the same setup to try. Vllm is gonna be so much better. How's your pp speed compare? Like double as fast just like tg in vllm?

[-]

Fabulous_Fact_606@reddit

backend	cold prefill (pp) rough estimate.	decode (tg)	max context
INT8 AutoRound (vLLM)	\~1520 tok/s u/20K	\~70-80 t/s	\~130K (still evaluating, looks good so far)
GGUF Q8_K_XL (llama.cpp, MTP)	\~943 tok/s u/64K	\~40 t/s	200K (Best)
INT4 AWQ-BF16 (vLLM+FlashInfer)	\~2303 tok/s	139 t/s peak	128K (to many buggy code)

[-]

philmarcracken@reddit

INT4 AWQ

'I'm doing millions of calculations a second and they're all wrong!'

[-]

EbbNorth7735@reddit

Do i need a special build of vLLM? Currently on Windows

[-]

lemondrops9@reddit

I've been debating on going vllm for coding as well. What is your PCIe bus speed for the 2nd card?

[-]

Fabulous_Fact_606@reddit

x8/x8. on a x870E chipset.

[-]

No_Lingonberry1201@reddit

Price of living on the edge 😎

[-]

ML-Future@reddit

Only three?

[-]

Valuable_Touch5670@reddit

I think the rapid development + the vibrancy of its developer community really beats the crap out of other inferencing engines. THIS is a prime example.

[-]

Anbeeld@reddit

...except llama.cpp was behind other engines by like a month, as they had MTP for quite some time already?

[-]

jacek2023@reddit (OP)

I still don't have skills to run vllm with better performance than llama.cpp on my setup (3x3090). Could you give me some tips how to run Gemma or qwen with 200000 context?

[-]

ohhi23021@reddit

i just use club-3090 with a few adjustments. DFlash is faster for coding than MTP. last i tried 2 days ago the llama.cpp crashed with mtp + tensor parallel.

[-]

jacek2023@reddit (OP)

what is your context length?

[-]

New_Comfortable7240@reddit

Well difference is in llama.cpp they are more careful about long term stability. So the idea is while implemented this feature would be more stable than in other projects. Also, llama.cpp have a wider support, for example my p40 are not supported in other projects, so for a project that big and with so much reach in support is normal to take their time adding features

[-]

Anbeeld@reddit

All of this might be true, but the original claim was "rapid development ... beats the crap out of other inferencing engines". Meanwhile I was trying out Qwen 3.6 27B MTP literally a month ago with vLLM, and I'd guess they advanced their implementation quite a bit since then too.

Besides, currently folks here rebuild their llama.cpp 3 times per day to get the latest fixes, so it's not like they shipped MTP in a finished "long term stablity" form.

[-]

LetsGoBrandon4256@reddit

llama.cpp bros would wait months for a buggy new feature than admitting their precious inferencing engine is falling behind the forks.

Now ask them if they have TurboQuants yet lmao.

[-]

Anbeeld@reddit

Obviously no one ever in the history of inference might need cache quants below 4 bit, so why would they need it? Proceeds to quote what GG wrote in some random PR like it's a fucking Bible

[-]

Mount_Gamer@reddit

Has context got better with 16GB vram cards?

I can see the speedup but the context means dropping to really low quants.

For instance, the 27B qwen 3.6, I seem to only get 50k at a Q2 Quant... Of course this could be user error, but I did follow the flags recommended by unsloth. I think at Q3 ~ 12.8k ctx.

[-]

ea_man@reddit

OMG do I have to run benchmarks again to re optimize settings?

:D

[-]