huge improvement after moving from ollama to llama.cpp

[-]

_derpiii_@reddit

So friggin cool. Henceforth, I vote all benchmarks to be visualized as tiny robot fights.

[-]

leonardosalvatore@reddit (OP)

then I think I'll improve the graphics a bit :-D

[-]

ML-Future@reddit

The more you get into llama.cpp, the more you find parameters to make it even better.

[-]

_derpiii_@reddit

I'm on Ollama because I did not realize llama.cpp was a different thing.

Are there any downsides? Why do people choose Ollama over llama.cpp?

[-]

ArtfulGenie69@reddit

The part that is way better is that you can use the normal jinja template instead of the shit go templates. Fixes so many problems.

[-]

No-Refrigerator-1672@reddit

... and then you discover vllm and sglang.

[-]

Real_Ebb_7417@reddit

Well, vLLM is also good but not necessarily better. It’s better at some things and worse at others.

[-]

No-Refrigerator-1672@reddit

Vllm is faster for any sequence longer than 8k tokens, if you can fit the model completely in VRAM. I'm testing it like every few months, it's always the case.

[-]

FinBenton@reddit

Atleast people used to say that vllm was pretty slow for a single user but was fast in paraller multiuser settings, maybe that has changed.

[-]

This is born out of launching llama-banch with default parameters, and never thinking deeper, if the default even match real usecases. I was playing around with vllm since summer 2025, and it always behaved like I'm saying.

[-]

Real_Ebb_7417@reddit

Well, I didn’t experiment with vLLM much myself, because I first tried it on previous GPU with 16Gb vRAM and it wasn’t able to handle offload to CPU properly for Qwen3.5 27b, so I just dropped it (still want to play with it soon though).

From what I’ve seen in the community though, people usually say that it handles multi-slots super good, while llama sucks at it totally (and I agree, I just always run llama.cpp with -np 1 xd), but also people say that with one slot, llama.cpp usually gives better speed. I didn’t test it myself though yet.

[-]

No-Refrigerator-1672@reddit

but also people say that with one slot, llama.cpp usually gives better speed

That's not etirely true. Llama.cpp is sometimes faster than vllm for sequences length 4k and below. The nuance is that 4k and below is never the case, because you also have a system prompt, tool definitions, and if your request is longer than "where is the Eiffel tower", then you blow past 4k barrier like instantly. And at 8k, llama.cpp is already on par or below vllm.

[-]

VoiceApprehensive893@reddit

depths of hell

[-]

Cute_Obligation2944@reddit

--cpu-moe FTW

[-]

Mkengine@reddit

Is that still necessary when using --fit?

[-]

suicidaleggroll@reddit

On a single GPU, I get better results with n-cpu-moe than any of the auto-fitting options

[-]

Mkengine@reddit

How do you determine the optimum n?

[-]

suicidaleggroll@reddit

Open nvtop in one window, and load the model with n-cpu-moe=20 or something large enough to not OOM in the other, with the desired context size. Look at the VRAM usage once the model is fully loaded. Now kill that and run it again with n-cpu-moe=19. The difference in VRAM usage is how much is required per layer. Then you can do some basic math to figure out the right number.

For example, if n-cpu-moe=20 gives you 10.3 GiB/16 GiB used, and n-cpu-moe=19 gives you 11.3 GiB/16 GiB used, that means it's 1 GiB per layer. With 19 layers offloaded you have another 4.7 GiB available, and with 1 GiB per layer, that means you could offload another 4 layers, so n-cpu-moe=15 would be the max.

[-]

Cute_Obligation2944@reddit

It's explicit for MoE models, where --fit iteratively tests allocations and can be applied to any model.

[-]

Mkengine@reddit

Okay, let me rephrase: since --fit is enabled by default, how does the performance of any MoE differ with just --fit vs. --fit + --cpu-moe?

[-]

Cute_Obligation2944@reddit

--cpu-moe offloads feed-forward layers to CPU, improving overall speed of MoE models.

[-]

jojorne@reddit

i crash with that (OOM)

[-]

suicidaleggroll@reddit

Then you set it too low

[-]

iamapizza@reddit

It's the perfect --fit

[-]

BrightRestaurant5401@reddit

ahh, the ollama command that nobody should use

[-]

vladlearns@reddit

then you start thinking maybe a little more --context would help

not too much though - you still want it to feel --balanced

[-]

leonardosalvatore@reddit (OP)

yep, just started with it.
Building up know how.
But soon I'll commit bringing it at work, into real problems, soon....

[-]

defmans7@reddit

I just leaned how to use llama-fit-params.exe, yesterday. Almost double my token gen speed for some models.

Also got some interesting results experimenting with turbo quant repos.

[-]

defmans7@reddit

Check out llama swap, swaps out models kinda like using ollama.

[-]

Objective-Stranger99@reddit

Llama.cpp has a built-in router mode.

[-]

andy2na@reddit

can you choose between different parameters of the same model with the built-in router like llama-swap? I.e., use Qwen3.5 thinking and instruct without having to reload the model. Thats my main use for llama-swap for qwen3.5 and gemma4 but if its built into llama.cpp, Id look into going that route

my llama-swap config example for qwen3.5-9b with selections between thinking, thinking-coding, and instruct.

  "Qwen3.5-9B":
    cmd: >
      /usr/local/bin/llama-server
       --port ${PORT}
      --host 127.0.0.1
      --model /models/qwen35/Huihui-Qwen3.5-9B.i1-IQ4_XS.gguf
      --mmproj /models/qwen35/Huihui-Qwen3.5-9B.mmproj-f16.gguf
      --cache-type-k q8_0
      --cache-type-v q8_0
      --image-min-tokens 1024 
      --n-gpu-layers auto
      --split-mode none
      --main-gpu 0
      --threads 8
      --threads-batch 8
      --ctx-size 32384
      --flash-attn on 
      --parallel 1 
      --batch-size 2048 
      --ubatch-size 512
      --jinja
      --cache-ram 2048 
    filters:
      stripParams: "temperature, top_p, top_k, min_p, presence_penalty, repeat_penalty"
      
      setParamsByID:
        "${MODEL_ID}:thinking":
          chat_template_kwargs:
            enable_thinking: true
          temperature: 1.0
          top_p: 0.95
          top_k: 20
          min_p: 0.0
          presence_penalty: 1.5
          repeat_penalty: 1.0


        "${MODEL_ID}:thinking-coding":
          chat_template_kwargs:
            enable_thinking: true
          temperature: 0.6
          top_p: 0.95
          top_k: 20
          min_p: 0.0
          presence_penalty: 0.0 
          repeat_penalty: 1.0


        "${MODEL_ID}:instruct":
          chat_template_kwargs:
            enable_thinking: false
          temperature: 0.7
          top_p: 0.8
          top_k: 20
          min_p: 0.0
          presence_penalty: 1.5
          repeat_penalty: 1.0  =

[-]

Objective-Stranger99@reddit

No, to my knowledge, it cannot do that, but I have never tried that, and I don't have a use case for that, so I suggest you try it out yourself.

[-]

andy2na@reddit

I researched it and llama.cpp cant do that, so llama-swap is still the best route for this

[-]

Objective-Stranger99@reddit

Because of you, I am now looking into llama-swap.

[-]

andy2na@reddit

It's extremely useful especially with all the newer models which have different parameters for different tasks but same model

[-]

Objective-Stranger99@reddit

Any other advantages it has? I would use it if it had enough pros to make a new switch from a "native" solution.

[-]

andy2na@reddit

basically still native llama.cpp. yeah, that's the biggest benefit, but llama-swap has a UI to load and unload different models as you please. Also, you can have concurrent models running at the same time (qwen3-embedding with qwen3.5). It also keeps a log of each activity and gives you easy to read prompt speed and generation speed and you can view detailed information about each activity

I build llama.cpp myself and combine it with llama-swap so it's always using the latest via a script that pulls, combines, and then loads the updated docker container

[-]

Objective-Stranger99@reddit

Seems nice, but not really fit for my use case. I barely have enough VRAM to run one model, let alone multiple. Also, Open WebUI logs all speed and token generation statistics, and llama-server won't generate much overhead.

[-]

andy2na@reddit

Activity page

[-]

robberviet@reddit

The days Llama.cpp has load/unload, I ditched llama-swap immediately.

[-]

Objective-Stranger99@reddit

It actually does! One way is to set the maximum number of models. When you load too many models, the oldest one gets unloaded. If you set it to 1, only the model you selected is loaded.

Open WebUI has an action plugin for 1-click load/unload. You can create a Python script yourself if you want for your frontend.

[-]

defmans7@reddit

Router mode looks okay but you have to maintain a seperate .ini for each model of you want the same features as swap.

Actually not great for my use case. But thanks for the suggestion anyway, I'll keep an eye on then feature for future improvements.

Using aliases is actually pretty helpful when switching models all the time.

[-]

wayne_oddstops@reddit

--models-preset accepts one .ini file. I use one file for router mode w/ embedders, rerankers, llms, etc.

[*] global setting 1 global setting 2

[custom_name] model path setting 1 override of global setting

[path to model] setting 1 setting 2

[-]

defmans7@reddit

Thanks man, I guess I read the docs wrong

[-]

defmans7@reddit

I did not know that... That's great to know, might make my setup a little less complex. Thanks for the heads up!

[-]

xeeff@reddit

I still use llama-swap. I remember looking into it and deciding it's still better

[-]

deepspace86@reddit

Yeah, being able to load up other types of models at the same time is still nice

[-]

ishbuggy@reddit

Same here. It was just easier to manage aliases for different configs of the same model. So I keep one model in VRAM all the time and use swap to manage calls to "different" models from other apps which just uses different parameters for different tasks. Swap was still much easier to manage this with than llama-server alone. Also, it let me intercept application system prompts and change them as needed without too much fuss.

[-]

Far-Low-4705@reddit

Swap still has some advantages that llama.cpp doesn’t

[-]

charmander_cha@reddit

Tem como configurar isso no opencode ?

[-]

defmans7@reddit

Of course, you can set custom endpoints in your opencode config. I use local llms in my opencode often.

[-]

charmander_cha@reddit

Mas como funciona na hora de mudar? O llama. Cpp possui um tutorial para isso? Pergunto porque no llama-swap eu uso um endpoint fixo e só alterno os modelos, não crio um endpoint para cada um deles

[-]

defmans7@reddit

You will define some models in llama.swap with the config, but you will also define custom models in your opencode config.

https://github.com/mostlygeek/llama-swap/blob/main/docs/configuration.md

You use a custom provider in opencode: https://opencode.ai/docs/providers/

Sorry, it is complicated to explain... And I can only understand English and some Swedish..

[-]

charmander_cha@reddit

Isso eu sei, eu já uso llama swap.

Eu quero saber como fazer isso somente com llama.cpp caso não seja possível, não faz diferença o roteador do llama.cpp para mim

[-]

defmans7@reddit

Sorry I didn't understand your question with auto translation..

You're looking for this: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#using-multiple-models

It can do similar to llama Swap, I haven't used, but it is possible. I might try it out.

[-]

robberviet@reddit

Another day, another user realized how awful Ollama is.

[-]

bcdr1037@reddit

That's a fun little project!

[-]

IrisColt@reddit

This is the way.

[-]

InitialFly6460@reddit

Yes, I can confirm that Ollam, Vllm, and Llm Studio significantly degrade the quality of model renderings. It's like Langraph, which can be replaced by simple Python scripts.

[-]

stopbanni@reddit

Vllm is entirely different backend, and lm studio is just frontend for llama.cpp

[-]

InitialFly6460@reddit

and sorry if you can get the same experience... maybe you are working for meta with huge cluster... it's ot my case...

[-]

stopbanni@reddit

No, I am gpu-poor with rx 6600 on vulkan, using llama.cpp.

[-]

InitialFly6460@reddit

yes I should have detailed a littel biit , so I edited. to be more precise for vllm.

[-]

pokemonplayer2001@reddit

r/ihadastrokereadingthis

[-]

Real_Ebb_7417@reddit

Bro is mad xd

It’s a plain wrong comparison. Better in a similar tone would be „Wordpress can replace React”.

[-]

NNN_Throwaway2@reddit

You can't "confirm" squat bud.

[-]

Rich_Artist_8327@reddit

then after you grow little bit older you are ready to jump from llama to real pro inference engines like vLLM.

[-]

shaakz@reddit

Real pro inference engines, what an asshat thing to say. Always use the right tool for the job, and if he isnt running batching and want to try frontier open-source models at release for a homelab, then there is literally no reason to go vLLM (production absolutely, but homelab? no)

[-]

InitialFly6460@reddit

yes olla