Anyone else running one of the pre-release branches of MTP support to maintain the higher speeds?
Posted by Creative-Type9411@reddit | LocalLLaMA | View on Reddit | 10 comments
I cant help myself its ~20% faster for me, I took the highest speed branch(for me), added the vision fix, and am just riding it out for now
i tried using the release today and during some light coding lamma.cpp crashed and the model restarted, and I didn't experience any crashes on the pre-release versions personally so I jumped back into it
just curious what everyone else is doing and if there were any major downsides on the early builds, anyone is aware of
DistanceAlert5706@reddit
I didn't test it before, tried today with release, after ~10 minutes llama.cpp crashed. Went back to no MTP.
Edenar@reddit
i'm using the pr from 1.5 weeks ago, my strix halo ran 3.6 35B Q8_K_XL withMTP up the whole week : no llama.cpp crash. I guess i'll stick to it for now !
asyncsec@reddit
Any chance you can link the exact PR you're using?
Edenar@reddit
https://github.com/ggml-org/llama.cpp/pull/22673
Irebuilt my container image on May 5th. I don't kept logs so i don't know what was the exact pr state. So i uploaded the full podaman image i built then and the dockerfile : https://github.com/CaGimenez/strix-halo-MTP (wont keep that up more than a few days)
Creative-Type9411@reddit (OP)
I put the one I have in this post https://www.reddit.com/r/LocalLLaMA/s/eWV3llY4nB
Acceptable_Push_2099@reddit
does it cause prefill speed up or no?
Creative-Type9411@reddit (OP)
i was squeezing 10t/s out with CPU and Ram only with this model, I found the MTP Post about three hours after I dropped my T4 in and I was still tuning it, but I think I'm more than doubled my prompt processing with this setup compared to non mtp
It's super finicky, if I -fitt just a little bit too much or too little it drops about 10 tokens per second out, I would say that setting affects performance more than anything else other than -t.. you have to find out what's right for your set up but once you get that dialed in within 512 MB, you will notice
Bulky-Priority6824@reddit
Not sure what's going on yet and I'll look into it soon enough but initial testing caused me to require dropping ctx nearly 45% to accommodate the mtp version of similar models for a loss in pp of up to 50% got about the same gained in tk/s
AdamDhahabi@reddit
I also had the impression Aman Gupta his work up until 10 days ago (this fork is now renamed to mtp-clean-old) was faster in token generation. I assume the later commits were for better prompt processing. Not sure.
phein4242@reddit
I compiled & tuned the one linked by unsloth, and thats doing a solid 55 tp/sec (20tp/sec w/o mtp)