MTP is nice and all, but what about PP speeds?
Posted by milpster@reddit | LocalLLaMA | View on Reddit | 28 comments
I don't know for the rest of you, but with my setup, as soon as i enable MTP, the PP performance and GPU usage drops significantly for some reason. It's not as much a memory issue for me as it is declining performance.
My setup is: 2x Radeon VII 16gb on ROCm, 1x Rtx3080 8gb Max Q on vulkan. Running Qwen 3.6 27B with KV at Q8. The Radeon VIIs are on 4x PCIe Risers, so maybe it is a bus contention issue?
That said, i also tried going full Vulkan, but that makes it worse by a long shot.
Anyone here that could please explain why that is the case?
Shoddy_Bed3240@reddit
It’s about 3x degradation from 6000 to 2000 t/s for me. Unfortunately it never been fixed after introducing MTP.
Ok_Warning2146@reddit
Then don't use MTP. MTP doesn't worth it for most use cases anyway.
andras_kiss@reddit
PP speed lowers as you get older. Nothing you can do a out it.
TheRealMasonMac@reddit
PP speed is pretty fast with OnlyFans. Maybe it's a system issue?
jcdoe@reddit
When you get to my age, sometimes PP just comes to a halt.
That’s when the doc checks my PPU (prostate processing unit).
milpster@reddit (OP)
older as in the session age? in my setup without mtp i dont have that. PP sometimes goes up and sometimes down a little, but there is no clear trend towards more slowness as CTX grows.
kaaninel@reddit
Yes pp sometimes goes up and sometimes down a little
Borkato@reddit
God I LOVE pps. Everyone dm me your pps, big or small, so I can inference all over it!
andras_kiss@reddit
it was sarcasm, sorry
AppealSame4367@reddit
You poor little lamb
Fickle-Box1433@reddit
Classic. OP asks a technical question and the top comment is already off the rails. Please, continue amusing me.
TheTerrasque@reddit
so you're saying you want more rail'ing?
andras_kiss@reddit
I'm sorry 😃.
jacek2023@reddit
It's very important to minimize prompt processing (number of tokens to process), make sure you use latest llama.cpp and you "preserve thinking", this way my prompt processing is fast
overand@reddit
preserve-thinking won't explicitly change PP tokens/sec, I assume, right? (But that it decreases the amount needed)
cezarducatti@reddit
Friend, I'm a beginner, what would be your recommendation for the RTX 3090 QWEN 3.6 27GB? I'm using MTP.
hurdurdur7@reddit
Prompt processing is slower mtp on both vulkan and rocm for me than without mtp, but with latest tune ups the token generation has gotten so much faster with mtp that i wouldn't run without it. Dual gpu setup. x8 x8 bifurication from the cpu lanes.
Oddly enough, in my case vulkan is faster than rocm for prompt parsing in -sm layer mode and a lot faster than rocm in -sm tensor mode (in my setup -20% with -b 2048 and -ub 512 ... perhaps bigger ub would improve it).
But when it comes to token generation - rocm with -sm tensor is the fastest, and vulkan/rocm in layer mode are pretty much neck on neck.
So no clear winner here for me, but as i prefer fast prompt processing in many scenarios i currently sail with vulkan. For some reason i also manage to fit more context cache on vulkan.
asfbrz96@reddit
PP degrades overtime
Sofakingwetoddead@reddit
Did you check to see if you're spilling over into system RAM after enabling MTP?
milpster@reddit (OP)
how would that happen? Normally i get failed-alloc crash if any of the gpus are full - unified memory shouldn't be enabled.
libregrape@reddit
did you not forget the `-ngld 99`?
Sofakingwetoddead@reddit
Ya, idk. Adding mtp consumes more vram. If you had enabled it and suddenly performance dropped, then it could be some spillover. Something to look at. If you're in windows you can pull up system monitor and see...
exact_constraint@reddit
What version of llama.cpp are you running? There was a pretty profound PP regression early on, but that’s been largely addressed now. Along w/ a few recently merged PRs that help knock down the VRAM usage w/ MTP. You are running a bit of an esoteric setup, however.
pepedombo@reddit
Provide pp loss by pushing 20-30k ctx and watching pp speed. Before mtp/mtp-bugs I had 1500tok/s pp without mtp, with mtp i'm getting \~900-1000pp when larger ctx is processed which is acceptable. I used to think it might be riser cables as only one gpu runs x16 and the rest is x1 but overall it's not bad.
Sometimes total time might be unfavorable for mtp and large context but the gain from TG comes from shorter strokes so it depends on what you do and whether cache works.
isengardo@reddit
Increase --ubatch-size.
https://i.imgur.com/yLwgOXY.png
WhatererBlah555@reddit
I also see a performance degradation in prompt processing with MTP enabled; still haven't investigated if the TG gains are enough to offset that or if in the end it is still better to not use MTP at all... if you did some testing I'll be happy to hear about that 😄
Schlick7@reddit
Pretty sure that MTP behaves like your model is a little bigger but then for responding it can use the results from that tiny MTP to skip getting results from the main model, something like that. So yes it will hurt PP performance. On the Vega chips where the PP is already awful i think most people leave it off. For agentic stuff it might be always better to leave it off on all setups, but not sure.
Charming-Author4877@reddit
PP speed is limited by GEMM compute, the only methods known to speed it up are destroying quality.
Pflash and summarization methods can be used to make context smaller.