KT-Kernel achieves up to >4.5x prefill and 30% faster decode compared to llama.cpp on the same hardware , why?

Posted by LegacyRemaster@reddit | LocalLLaMA | View on Reddit | 8 comments

https://preview.redd.it/9nmgbg6w6d9g1.png?width=957&format=png&auto=webp&s=9bdcf6353fe068da6eb694ed7fadfe45d86d6de4 From : [https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/kt-kernel/MiniMax-M2.1-Tutorial.md](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/kt-kernel/MiniMax-M2.1-Tutorial.md) I was surprised by the difference in performance during prefill. I myself noticed that when using Qwen Next 80 on llama.cpp or on Sglang, the latter's performance is clearly superior (and I know how much effort the team put into making Next run on llama.cpp). But I didn't expect such a big difference. Do you think this performance gap could be closed?

Reply to Post

8 Comments

[-]

Chromix_@reddit

The prompt processing speed rises from 400 TPS at 2k context to 4k TPS at 32k context. Now that's some increase - massive parallelization? >...could not achieve optimal prefill and decode with a single command The command line and version for llama.cpp wasn't shared, so one can only speculate how things were run. Benchmarks should be reproducible.

[-]

Funny_Plate6081@reddit

Yeah this smells like they're comparing optimized parallel processing against some basic llama.cpp setup. Without seeing the exact command flags and batch sizes they used it's kinda sus The 4.5x prefill jump screams batching optimization that llama.cpp probably wasn't configured for. Would love to see them run this again with properly tuned llama.cpp params

[-]

LegacyRemaster@reddit (OP)

so you think it's a biased result?

[-]

easyrider99@reddit

Can confirm the real deal. Ktransformers/kt-kernel have put together an amazing package

[-]

Chromix_@reddit

llama.cpp can certainly be optimized more for performance, as for example the ik\_llama fork shows for MoE models. I find it strange that in config 1 (with GPU?) there's some prompt processing speed, but less than a token per second for inference. Then in config 2 there's barely any prompt processing speed (no GPU?), but a somewhat decent inference speed. Thus, it'd be interesting to know the exact version and settings used for benchmarking. I would've expected those to be shared, especially after reading that they couldn't achieve an optimal prompt processing and inference speed with a single command / single settings.

[-]

KT-Kernel achieves up to >4.5x prefill and 30% faster decode compared to llama.cpp on the same hardware , why?

Reply to Post

8 Comments

Chromix_@reddit

Funny_Plate6081@reddit

LegacyRemaster@reddit (OP)

easyrider99@reddit

Chromix_@reddit

a_beautiful_rhind@reddit

easyrider99@reddit

YouAreTheCornhole@reddit