Anyone else running one of the pre-release branches of MTP support to maintain the higher speeds?

Posted by Creative-Type9411@reddit | LocalLLaMA | View on Reddit | 10 comments

I cant help myself its ~20% faster for me, I took the highest speed branch(for me), added the vision fix, and am just riding it out for now

i tried using the release today and during some light coding lamma.cpp crashed and the model restarted, and I didn't experience any crashes on the pre-release versions personally so I jumped back into it

just curious what everyone else is doing and if there were any major downsides on the early builds, anyone is aware of

[-]

DistanceAlert5706@reddit

I didn't test it before, tried today with release, after ~10 minutes llama.cpp crashed. Went back to no MTP.

[-]

Edenar@reddit

i'm using the pr from 1.5 weeks ago, my strix halo ran 3.6 35B Q8_K_XL withMTP up the whole week : no llama.cpp crash. I guess i'll stick to it for now !

[-]

asyncsec@reddit

Any chance you can link the exact PR you're using?

[-]

Edenar@reddit

https://github.com/ggml-org/llama.cpp/pull/22673
Irebuilt my container image on May 5th. I don't kept logs so i don't know what was the exact pr state. So i uploaded the full podaman image i built then and the dockerfile : https://github.com/CaGimenez/strix-halo-MTP (wont keep that up more than a few days)

[-]

Creative-Type9411@reddit (OP)

I put the one I have in this post https://www.reddit.com/r/LocalLLaMA/s/eWV3llY4nB

[-]

Acceptable_Push_2099@reddit

does it cause prefill speed up or no?

[-]

Creative-Type9411@reddit (OP)

i was squeezing 10t/s out with CPU and Ram only with this model, I found the MTP Post about three hours after I dropped my T4 in and I was still tuning it, but I think I'm more than doubled my prompt processing with this setup compared to non mtp

It's super finicky, if I -fitt just a little bit too much or too little it drops about 10 tokens per second out, I would say that setting affects performance more than anything else other than -t.. you have to find out what's right for your set up but once you get that dialed in within 512 MB, you will notice

[-]