KT-Kernel achieves up to >4.5x prefill and 30% faster decode compared to llama.cpp on the same hardware , why?
Posted by LegacyRemaster@reddit | LocalLLaMA | View on Reddit | 8 comments
https://preview.redd.it/9nmgbg6w6d9g1.png?width=957&format=png&auto=webp&s=9bdcf6353fe068da6eb694ed7fadfe45d86d6de4
From : [https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/kt-kernel/MiniMax-M2.1-Tutorial.md](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/kt-kernel/MiniMax-M2.1-Tutorial.md)
I was surprised by the difference in performance during prefill. I myself noticed that when using Qwen Next 80 on llama.cpp or on Sglang, the latter's performance is clearly superior (and I know how much effort the team put into making Next run on llama.cpp). But I didn't expect such a big difference. Do you think this performance gap could be closed?
8 Comments
Chromix_@reddit
Funny_Plate6081@reddit
LegacyRemaster@reddit (OP)
easyrider99@reddit
Chromix_@reddit
a_beautiful_rhind@reddit
easyrider99@reddit
YouAreTheCornhole@reddit