Switching from Ollama to llama-swap + llama.cpp on NixOS: why I finally made the jump after adding a second RTX 3090
Posted by basnijholt@reddit | LocalLLaMA | View on Reddit | 3 comments
Hi r/LocalLLaMa!
You guys convinced me I needed more 3090s!
I tried llama-swap a few months ago when gpt-oss-20b was broken in Ollama. Got it working, but went back to Ollama out of laziness—ollama pull is just too convenient.
Last month I added a second RTX 3090 (48GB VRAM total) and started running into Ollama's limitations. When you want to balance layers between GPUs and offload specific parts to system RAM, the "magic" abstractions get in the way. I wanted to say "put 40 layers on GPU 0, 40 on GPU 1, offload 15 MoE experts to CPU RAM." With llama.cpp that's just command-line flags. With llama-swap I can define it per-model in a config file.
One thing that surprised me: I was getting 8 tokens/sec on gpt-oss:120b initially. Turned out the default llama.cpp build wasn't enabling BLAS or native CPU optimizations. After enabling blasSupport and -DGGML_NATIVE=ON, jumped to 50 tokens/sec. Compile-time flags matter a lot for CPU-offloaded layers.
The trade-off is you lose ollama pull. You have to find the GGUF on HuggingFace, pick your quantization, and write a few lines of YAML. But honestly that forces you to understand what you're running instead of blindly accepting defaults.
I wrote up my full setup (running on NixOS with a declarative config) here: https://www.nijho.lt/post/llama-nixos/
AccordingRespect3599@reddit
Get 25 tkps on qwen next 80b. You don't need another gguf.
basnijholt@reddit (OP)
Which quant do you use and what settings? 😄
AccordingRespect3599@reddit
Q4 + recommended settings from unsloth.