I'm running an agentic system with kobold.cpp as my backend. Am I losing performance?
Posted by AlphaSyntauri@reddit | LocalLLaMA | View on Reddit | 5 comments
Currently, I'm running a Hermes agent with an OpenAI v1 compatible endpoint provided by Kobold. My setup is a a 24GB 3090Ti + 512GB DDR4 running Qwen3.6-35B-A3B.
I plan to move to a larger MoE model once I'm satisfied with how everything is working, but I'm just wondering if I'm sacrificing performance by not using llama.cpp standalone and relying on a program that's more focused on ease of use.
To my knowledge it's just a simple wrapper, but I'm curious if anyone has any experience swapping between Kobold and other local endpoints. Thanks!
BC_MARO@reddit
Kobold is basically a llama.cpp fork, so perf is usually within a few percent unless you're on an old build or missing newer kernels/quants. If you're curious, run a same-prompt tok/s benchmark against current llama.cpp and you'll know in 5 minutes.
a_beautiful_rhind@reddit
ik_llama might be faster. doubt you're missing much in kobold vs mainline.
FullOf_Bad_Ideas@reddit
You are probably not losing anything meaningful. Just make sure to use the latest version of kobold.
Dany0@reddit
llama.cpp offers flexibility. you don't lose too much with kobold cpp
vLLM is where speed is at, especially with multigpu setups like yours
Setting it up is a bit more work, but you can get a clanker to do it for you
Herr_Drosselmeyer@reddit
Sort of. Kobold does run its own fork of llama.cpp, so there could be differences. They may delay or omit certain features of llama.cpp in order to make sure they don't break anything. That could then lead to performance differences.
Personally, I found that using Oobabooga's TextGen gave me better performance, but you kind of have to try the different setups yourself, because things change fast.