huge improvement after moving from ollama to llama.cpp
Posted by leonardosalvatore@reddit | LocalLLaMA | View on Reddit | 69 comments
Those are tiny robots fighting each other to survive.
Between matches only one class of robots are driven by qwen3 coder generated code and it does improve match after match...
https://www.youtube.com/watch?v=FMspkoXseRw
Is funny to set different parameters and watch it.
Code:
https://github.com/leonardosalvatore/llm-robot-wars
_derpiii_@reddit
So friggin cool. Henceforth, I vote all benchmarks to be visualized as tiny robot fights.
leonardosalvatore@reddit (OP)
then I think I'll improve the graphics a bit :-D
ML-Future@reddit
The more you get into llama.cpp, the more you find parameters to make it even better.
_derpiii_@reddit
I'm on Ollama because I did not realize llama.cpp was a different thing.
Are there any downsides? Why do people choose Ollama over llama.cpp?
ArtfulGenie69@reddit
The part that is way better is that you can use the normal jinja template instead of the shit go templates. Fixes so many problems.
No-Refrigerator-1672@reddit
... and then you discover vllm and sglang.
Real_Ebb_7417@reddit
Well, vLLM is also good but not necessarily better. It’s better at some things and worse at others.
No-Refrigerator-1672@reddit
Vllm is faster for any sequence longer than 8k tokens, if you can fit the model completely in VRAM. I'm testing it like every few months, it's always the case.
FinBenton@reddit
Atleast people used to say that vllm was pretty slow for a single user but was fast in paraller multiuser settings, maybe that has changed.
No-Refrigerator-1672@reddit
This is born out of launching llama-banch with default parameters, and never thinking deeper, if the default even match real usecases. I was playing around with vllm since summer 2025, and it always behaved like I'm saying.
Real_Ebb_7417@reddit
Well, I didn’t experiment with vLLM much myself, because I first tried it on previous GPU with 16Gb vRAM and it wasn’t able to handle offload to CPU properly for Qwen3.5 27b, so I just dropped it (still want to play with it soon though).
From what I’ve seen in the community though, people usually say that it handles multi-slots super good, while llama sucks at it totally (and I agree, I just always run llama.cpp with -np 1 xd), but also people say that with one slot, llama.cpp usually gives better speed. I didn’t test it myself though yet.
No-Refrigerator-1672@reddit
That's not etirely true. Llama.cpp is sometimes faster than vllm for sequences length 4k and below. The nuance is that 4k and below is never the case, because you also have a system prompt, tool definitions, and if your request is longer than "where is the Eiffel tower", then you blow past 4k barrier like instantly. And at 8k, llama.cpp is already on par or below vllm.
VoiceApprehensive893@reddit
depths of hell
Cute_Obligation2944@reddit
--cpu-moe FTW
Mkengine@reddit
Is that still necessary when using --fit?
suicidaleggroll@reddit
On a single GPU, I get better results with n-cpu-moe than any of the auto-fitting options
Mkengine@reddit
How do you determine the optimum n?
suicidaleggroll@reddit
Open nvtop in one window, and load the model with n-cpu-moe=20 or something large enough to not OOM in the other, with the desired context size. Look at the VRAM usage once the model is fully loaded. Now kill that and run it again with n-cpu-moe=19. The difference in VRAM usage is how much is required per layer. Then you can do some basic math to figure out the right number.
For example, if n-cpu-moe=20 gives you 10.3 GiB/16 GiB used, and n-cpu-moe=19 gives you 11.3 GiB/16 GiB used, that means it's 1 GiB per layer. With 19 layers offloaded you have another 4.7 GiB available, and with 1 GiB per layer, that means you could offload another 4 layers, so n-cpu-moe=15 would be the max.
Cute_Obligation2944@reddit
It's explicit for MoE models, where --fit iteratively tests allocations and can be applied to any model.
Mkengine@reddit
Okay, let me rephrase: since --fit is enabled by default, how does the performance of any MoE differ with just --fit vs. --fit + --cpu-moe?
Cute_Obligation2944@reddit
--cpu-moe offloads feed-forward layers to CPU, improving overall speed of MoE models.
jojorne@reddit
i crash with that (OOM)
suicidaleggroll@reddit
Then you set it too low
iamapizza@reddit
It's the perfect --fit
BrightRestaurant5401@reddit
ahh, the ollama command that nobody should use
vladlearns@reddit
then you start thinking maybe a little more --context would help
not too much though - you still want it to feel --balanced
leonardosalvatore@reddit (OP)
yep, just started with it.
Building up know how.
But soon I'll commit bringing it at work, into real problems, soon....
defmans7@reddit
I just leaned how to use llama-fit-params.exe, yesterday. Almost double my token gen speed for some models.
Also got some interesting results experimenting with turbo quant repos.
defmans7@reddit
Check out llama swap, swaps out models kinda like using ollama.
Objective-Stranger99@reddit
Llama.cpp has a built-in router mode.
andy2na@reddit
can you choose between different parameters of the same model with the built-in router like llama-swap? I.e., use Qwen3.5 thinking and instruct without having to reload the model. Thats my main use for llama-swap for qwen3.5 and gemma4 but if its built into llama.cpp, Id look into going that route
my llama-swap config example for qwen3.5-9b with selections between thinking, thinking-coding, and instruct.
Objective-Stranger99@reddit
No, to my knowledge, it cannot do that, but I have never tried that, and I don't have a use case for that, so I suggest you try it out yourself.
andy2na@reddit
I researched it and llama.cpp cant do that, so llama-swap is still the best route for this
Objective-Stranger99@reddit
Because of you, I am now looking into llama-swap.
andy2na@reddit
It's extremely useful especially with all the newer models which have different parameters for different tasks but same model
Objective-Stranger99@reddit
Any other advantages it has? I would use it if it had enough pros to make a new switch from a "native" solution.
andy2na@reddit
basically still native llama.cpp. yeah, that's the biggest benefit, but llama-swap has a UI to load and unload different models as you please. Also, you can have concurrent models running at the same time (qwen3-embedding with qwen3.5). It also keeps a log of each activity and gives you easy to read prompt speed and generation speed and you can view detailed information about each activity
I build llama.cpp myself and combine it with llama-swap so it's always using the latest via a script that pulls, combines, and then loads the updated docker container
Objective-Stranger99@reddit
Seems nice, but not really fit for my use case. I barely have enough VRAM to run one model, let alone multiple. Also, Open WebUI logs all speed and token generation statistics, and llama-server won't generate much overhead.
andy2na@reddit
Activity page
robberviet@reddit
The days Llama.cpp has load/unload, I ditched llama-swap immediately.
Objective-Stranger99@reddit
It actually does! One way is to set the maximum number of models. When you load too many models, the oldest one gets unloaded. If you set it to 1, only the model you selected is loaded.
Open WebUI has an action plugin for 1-click load/unload. You can create a Python script yourself if you want for your frontend.
defmans7@reddit
Router mode looks okay but you have to maintain a seperate .ini for each model of you want the same features as swap.
Actually not great for my use case. But thanks for the suggestion anyway, I'll keep an eye on then feature for future improvements.
Using aliases is actually pretty helpful when switching models all the time.
wayne_oddstops@reddit
--models-preset accepts one .ini file. I use one file for router mode w/ embedders, rerankers, llms, etc.
[*] global setting 1 global setting 2
[custom_name] model path setting 1 override of global setting
[path to model] setting 1 setting 2
defmans7@reddit
Thanks man, I guess I read the docs wrong
defmans7@reddit
I did not know that... That's great to know, might make my setup a little less complex. Thanks for the heads up!
xeeff@reddit
I still use llama-swap. I remember looking into it and deciding it's still better
deepspace86@reddit
Yeah, being able to load up other types of models at the same time is still nice
ishbuggy@reddit
Same here. It was just easier to manage aliases for different configs of the same model. So I keep one model in VRAM all the time and use swap to manage calls to "different" models from other apps which just uses different parameters for different tasks. Swap was still much easier to manage this with than llama-server alone. Also, it let me intercept application system prompts and change them as needed without too much fuss.
Far-Low-4705@reddit
Swap still has some advantages that llama.cpp doesn’t
charmander_cha@reddit
Tem como configurar isso no opencode ?
defmans7@reddit
Of course, you can set custom endpoints in your opencode config. I use local llms in my opencode often.
charmander_cha@reddit
Mas como funciona na hora de mudar? O llama. Cpp possui um tutorial para isso? Pergunto porque no llama-swap eu uso um endpoint fixo e só alterno os modelos, não crio um endpoint para cada um deles
defmans7@reddit
You will define some models in llama.swap with the config, but you will also define custom models in your opencode config.
https://github.com/mostlygeek/llama-swap/blob/main/docs/configuration.md
You use a custom provider in opencode: https://opencode.ai/docs/providers/
Sorry, it is complicated to explain... And I can only understand English and some Swedish..
charmander_cha@reddit
Isso eu sei, eu já uso llama swap.
Eu quero saber como fazer isso somente com llama.cpp caso não seja possível, não faz diferença o roteador do llama.cpp para mim
defmans7@reddit
Sorry I didn't understand your question with auto translation..
You're looking for this: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#using-multiple-models
It can do similar to llama Swap, I haven't used, but it is possible. I might try it out.
robberviet@reddit
Another day, another user realized how awful Ollama is.
bcdr1037@reddit
That's a fun little project!
IrisColt@reddit
This is the way.
InitialFly6460@reddit
Yes, I can confirm that Ollam, Vllm, and Llm Studio significantly degrade the quality of model renderings. It's like Langraph, which can be replaced by simple Python scripts.
stopbanni@reddit
Vllm is entirely different backend, and lm studio is just frontend for llama.cpp
InitialFly6460@reddit
and sorry if you can get the same experience... maybe you are working for meta with huge cluster... it's ot my case...
stopbanni@reddit
No, I am gpu-poor with rx 6600 on vulkan, using llama.cpp.
InitialFly6460@reddit
yes I should have detailed a littel biit , so I edited. to be more precise for vllm.
pokemonplayer2001@reddit
r/ihadastrokereadingthis
Real_Ebb_7417@reddit
Bro is mad xd
It’s a plain wrong comparison. Better in a similar tone would be „Wordpress can replace React”.
NNN_Throwaway2@reddit
You can't "confirm" squat bud.
Rich_Artist_8327@reddit
then after you grow little bit older you are ready to jump from llama to real pro inference engines like vLLM.
shaakz@reddit
Real pro inference engines, what an asshat thing to say. Always use the right tool for the job, and if he isnt running batching and want to try frontier open-source models at release for a homelab, then there is literally no reason to go vLLM (production absolutely, but homelab? no)
InitialFly6460@reddit
yes olla