Why GLM on llama.cpp has no MTP?
Posted by Expensive-Paint-9490@reddit | LocalLLaMA | View on Reddit | 9 comments
I have searched through the repo discussions and PRs but I can't find references. GLM models have embedded layers for multi-token prediction and speculative decoding. They can be used with vLLM - if you have hundreds GB VRAM, of course.
Does anybody know why llama.cpp chose to not support this feature?
9 Comments
coder543@reddit
coder543@reddit
Karyo_Ten@reddit
-dysangel-@reddit
DistanceSolar1449@reddit
reditzer@reddit
Time_Reaper@reddit
jacek2023@reddit
cantgetthistowork@reddit