IK_LLAMA now supports Qwen3.5 MTP Support :O
Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 36 comments
https://github.com/ikawrakow/ik_llama.cpp/pull/1698Compile compile compile!
Clean_Initial_9618@reddit
How was mtp help is it better then turboquant?
fragment_me@reddit (OP)
They are not comparable.
Clean_Initial_9618@reddit
Which is better
fragment_me@reddit (OP)
They are not comparable, but I think MTP is more relevant than turboquant. Turboquant does NOT seem to be better than Q4_0.
Mountain_Patience231@reddit
llama cpp is now a failing project to be honest
JayPSec@reddit
It's not failing, they are on multiple fronts though.
ik_llamabenefits from this as well, wait till a good feature is merged to mainline and just import it. I wish that mainline did the same. I'd love to use ubergarm's quants with mainline's backend agnostic tensor parallelism. Honestly, I'm grateful we have both but it seems to me they'd both benefit more from cooperation.Such_Advantage_6949@reddit
Using 3.6 27b 6bpw with dflash using exllama v3, i get above 100tok/s on average on rtx 6000 pro
Glittering-Call8746@reddit
Exllama v3 works on 5090 ?
Such_Advantage_6949@reddit
Yes
Glittering-Call8746@reddit
U have repo to document such builds ?
Such_Advantage_6949@reddit
U can install exllama v3 using instruction from exllama repo, it is not gguf it is a different format, it meant to run exclusively on gpu vram, so no support for offloading, but speed is fast
Glittering-Call8746@reddit
https://youtu.be/Y4r-d2p_Evk?si=UaTba-uFn2wKE2AD this works ?
Such_Advantage_6949@reddit
It should, tabby is what most ppl using. U can join exllama discord if u have further question
This_Maintenance_834@reddit
vllm with correct MTP setting will do 122 TPS(144 peak) with qwen3.6-27b-autoround on RTX PRO 6000 also.
FullstackSensei@reddit
Calling u/yoracale
Can we pretty please get some unsloth MTP GGUF quants of 3.6 27B?
computehungry@reddit
If you have 64GB+ RAM, you can just make it yourself. Unsloth publishes the imatrix. You can even copy their recipe to get identical weights (which might take you like 10 minutes of labor, or you could ask an LLM to generate the command).
FullstackSensei@reddit
I have enough VRAM to run 9 instances at Q8 in parallel and I would like to run their Q8_K_XL with MTP
computehungry@reddit
Yeah, I'm saying you can make the Q8_K_XL yourself, if you can't wait for them (I don't think they did ik compatible stuff before). They publish everything you need to do this, their magic is in how they generate those files but they give the result to you.
Use this which is already in your llama.cpp install https://github.com/ggml-org/llama.cpp/tree/master/tools/quantize .
Should take you 10 minutes of reading and setup, 10 minutes of copypasting, 10 minutes of converting, and you learn how quantizing works in llamacpp too. Otherwise, you can use VLLM if that's really your setup, though I don't understand why you'd want multiple instances to waste vram.
FullstackSensei@reddit
I've quantized models myself before, so I know how to do that and don't need a lecture about learning.
Unsloth's dynamic quant 2.0 uses a proprietary calibration dataset to derive the iMatrix, which they haven't made public. So, no, I can't reproduce it myself. Otherwise, I have more than enough VRAM and cores to do it myself in significantly less than 10 minutes.
As for why I'd want multiple instances, 14 of my 18 GPUs can't run vLLM (P40 and Mi50), or at not vanilla vLLM. I run 2 and sometimes 3 instances to of the same model in parallel, to work on separate tasks. Batching on those GPUs tends to yield very minimal gains because they don't have the compute to keep up.
computehungry@reddit
You can check their Qwen 3.6 27B GGUF repo on huggingface, they have the imatrix and it is public. I didn't verify theirs actually worked but I confirmed another imatrix file for qwen3.6 27b, which worked, had the same filesize.
KickLassChewGum@reddit
This... is not a great model. I've had it throw together a quick & simple HFTransformers-based inference script and it completely bungled it, hallucinated a bunch of non-existent config flags, wrote 250 lines of dead code, and added a comment that it was "tested & working."
Gemma 4 31B wrote 40 lines and nailed it.
butlan@reddit
With the 3090 + 3060 setup, I’m getting around 25 tokens/s for the Q8 model in the link, and I was already getting about 21 tokens/s with llama.cpp, so it didn’t really make much difference for me.
This_Maintenance_834@reddit
i don’t if there is something wrong with the llama.cpp or ik_llama.cpp implementation. but on vllm the MTP prediction can more than double the token rate with lot better prediction success rate at MTP=3.
something does not feel right on this implementation. Token rate should not drop at MTP 2.
AdamDhahabi@reddit
30% tg speedup for 2x 5070 Ti, very nice.
Will need Q6 MTP quant because limited to 56K context with Q8.
my_name_isnt_clever@reddit
Why do we still have two competing forks of the same project actively used? I get people have disagreements but this split of a project that is already struggling to keep up is just inefficent and silly.
StupidScaredSquirrel@reddit
The owners of both branches are both hoghly competent but refuse to work together. What's your suggestion? Do you have a solution that doesn't result in there being 3 main branches?
my_name_isnt_clever@reddit
Yeah, the community picking one because drama like this is stupid.
I've seen it plenty of times; the one that comes to mind is NeoForge splitting off from Forge in Minecraft modding to get away from a dev who's difficult to work with. The whole Forge community moved over as well, because nobody wants another mod loader community split.
ArtfulGenie69@reddit
It's not exactly drama it's more like llama.cpp is a very large tanker ship that needs time to turn where as ik is smaller and can do some of the newer fancy things quicker. I have had issues with ik and continuous batching, things like that are fixed in llama.cpp usually because they take the slow methodical route and have way more eyes on the projects. Ok can just try stuff, that's why you want both
my_name_isnt_clever@reddit
Good metaphor and you're not wrong, but this split is absolutely due to drama unfortuntely. But that's a good way to look at it.
Marksta@reddit
Because there was an ever so slight disagreement and neither side care in the slightest to accept any fault what so ever and say sorry. It's a "Look what you made me do" situation, in so that what started the argument matters not, because they can use the present to retroactively justify their past mistakes.
The long version and some maintainers arguing and Georgi shutting down the most recent attempt at reconciliation can be found in this comment thread, if you're interested.
StupidScaredSquirrel@reddit
Stupid question but does llama.cpp support this already?
fragment_me@reddit (OP)
I didn't see that this was copied from llama.cpp but I could be wrong. The person who made the PR only submitted it to ik_llama.cpp. With that being said, I don't see why they couldn't just copy it over with some tweaks.
stddealer@reddit
Whenever one branch is even suspected of having code copied over fro the other branches, it creates a lot of drama. This is a really silly situation.
rerri@reddit
Nope
maxpayne07@reddit
thats normal flash atention : this new stuff- better for long context / RAG - 3x faster, theory...
fragment_me@reddit (OP)
Some quick benchmarks:
prompt:
write a 200 word story
layer split
with no mtp 18-20 tok/s
with mtp 1 30 tok/s
with mtp 2 16-19 tok/s
graph split
with no mtp 32-33 tok/s (noticable higher GPU utilization than with mtp 1, 20-30% more with more power draw)
with mtp 1 34-35 tok/s
with mtp 2 21 tok/s