I'm running an agentic system with kobold.cpp as my backend. Am I losing performance?

Posted by AlphaSyntauri@reddit | LocalLLaMA | View on Reddit | 5 comments

Currently, I'm running a Hermes agent with an OpenAI v1 compatible endpoint provided by Kobold. My setup is a a 24GB 3090Ti + 512GB DDR4 running Qwen3.6-35B-A3B.

I plan to move to a larger MoE model once I'm satisfied with how everything is working, but I'm just wondering if I'm sacrificing performance by not using llama.cpp standalone and relying on a program that's more focused on ease of use.

To my knowledge it's just a simple wrapper, but I'm curious if anyone has any experience swapping between Kobold and other local endpoints. Thanks!

[-]

BC_MARO@reddit

Kobold is basically a llama.cpp fork, so perf is usually within a few percent unless you're on an old build or missing newer kernels/quants. If you're curious, run a same-prompt tok/s benchmark against current llama.cpp and you'll know in 5 minutes.

a_beautiful_rhind@reddit

ik_llama might be faster. doubt you're missing much in kobold vs mainline.

FullOf_Bad_Ideas@reddit

You are probably not losing anything meaningful. Just make sure to use the latest version of kobold.

Dany0@reddit

llama.cpp offers flexibility. you don't lose too much with kobold cpp

vLLM is where speed is at, especially with multigpu setups like yours

Setting it up is a bit more work, but you can get a clanker to do it for you

Herr_Drosselmeyer@reddit

To my knowledge it's just a simple wrapper,

Sort of. Kobold does run its own fork of llama.cpp, so there could be differences. They may delay or omit certain features of llama.cpp in order to make sure they don't break anything. That could then lead to performance differences.

Personally, I found that using Oobabooga's TextGen gave me better performance, but you kind of have to try the different setups yourself, because things change fast.