Spent weekend tuning LLM server to hone my nerdism so you don't have to.
Posted by ChopSticksPlease@reddit | LocalLLaMA | View on Reddit | 6 comments
The Art:
https://preview.redd.it/1rdwk3yykq8g1.jpg?width=2494&format=pjpg&auto=webp&s=562c0dcecf89a3227a2627572e902afca5384bfb
Tl;dr; I've spent some time setting up local AI server with various models for chat and agentic coding in VS code + Cline. The goal was to replace Ollama with llama.cpp and squeeze as much performance as I can from the hardware (Dual RTX 3090 + CPU). The llama-swap configuration with llama.cpp command and options and some extra information is here in the repo: [https://github.com/cepa/llama-nerd](https://github.com/cepa/llama-nerd)
You can consider this a sample or a reference, it should work if you have 48+ GB of VRAM but you can scale it up or down by adjusting quant and context size in most models.
I guess that config may be helpful for some of you who want to ditch ollama for good.
The Artist:
https://preview.redd.it/kdikr0zgmq8g1.jpg?width=1080&format=pjpg&auto=webp&s=6c500bd772de3ea9be6e8f1f47d542fcf45d2611
The llama-swap config:
# llama-swap-config.yaml
# Hardware:
# Dell T7910
# GPU: 2x NVIDIA RTX 3090 (Total 48GB VRAM)
# CPU: 2x Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz (2 Sockets x 40 Cores)
# RAM: 256GB DDR4
# Virtual Machine:
# OS: Ubuntu + Nvidia CUDA Drivers
# vCPU: 40 Cores
# RAM: 64GB
# GPU: 2x NVIDIA RTX 3090 (48GB VRAM) (PCIe Passthrough)
# Disk: 1TB NVMe (PCIe Passthrough)
# NUMA: To DISABLE NUMA, the VM is pinned to physical CPU0 with 64GB RAM and both GPUs.
models:
# ---------------------------------------------------------------------------
# Coding models
# ---------------------------------------------------------------------------
# https://huggingface.co/unsloth/Seed-OSS-36B-Instruct-GGUF
# https://huggingface.co/magiccodingman/Seed-OSS-36B-Instruct-unsloth-MagicQuant-Hybrid-GGUF
# Q6_K_XL with quantized 96k context size to fit in the 48GB VRAM for speed
Seed-OSS-36B-Instruct-UD-Q5_K_XL:
cmd: >
llama-server --port ${PORT}
--model /models/Seed-OSS-36B-Instruct-UD-Q5_K_XL.gguf
--n-gpu-layers 999
--ctx-size 131072
--temp 1.1
--top-p 0.95
--cache-type-k q8_0
--cache-type-v q8_0
--flash-attn on
aliases:
- seed-oss
# https://docs.unsloth.ai/models/qwen3-coder-how-to-run-locally
Qwen3-Coder-30B-A3B-Instruct-Q8_0:
cmd: >
llama-server --port ${PORT}
--model /models/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf
--n-gpu-layers 999
--ctx-size 131072
--temp 0.2
--min-p 0.0
--top-p 0.90
--top-k 20
--repeat-penalty 1.05
--cache-type-k q8_0
--cache-type-v q8_0
--flash-attn on
aliases:
- qwen3-coder
Qwen2.5-Coder-32B-Instruct-Q8_0:
cmd: >
llama-server --port ${PORT}
--model /models/Qwen2.5-Coder-32B-Instruct-Q8_0.gguf
--n-gpu-layers 999
--ctx-size 131072
--temp 0.2
--min-p 0.0
--top-p 0.90
--top-k 20
--repeat-penalty 1.05
--cache-type-k q8_0
--cache-type-v q8_0
--flash-attn on
aliases:
- qwen2.5-coder
# https://docs.unsloth.ai/models/devstral-2
Devstral-Small-2-24B-Instruct-2512-Q8_0:
cmd: >
llama-server --port ${PORT}
--model /models/Devstral-Small-2-24B-Instruct-2512-Q8_0.gguf
--mmproj /models/mmproj-Devstral-Small-2-24B-Instruct-2512-F16.gguf
--n-gpu-layers 999
--ctx-size 131072
--jinja
--temp 0.15
--cache-type-k q8_0
--cache-type-v q8_0
--flash-attn on
aliases:
- devstral-small-2
# Devstral is a dense model, 123b works 2tps or less.
#Devstral-2-123B-Instruct-2512-IQ4_XS:
# cmd: >
# llama-server --port ${PORT}
# --model /models/Devstral-2-123B-Instruct-2512-IQ4_XS-00001-of-00002.gguf
# --n-gpu-layers 58
# --ctx-size 32768
# --jinja
# --temp 0.15
# --cache-type-k q4_0
# --cache-type-v q4_0
# https://docs.unsloth.ai/models/nemotron-3
Nemotron-3-Nano-30B-A3B-Q8_0:
cmd: >
llama-server --port ${PORT}
--model /models/Nemotron-3-Nano-30B-A3B-Q8_0.gguf
--n-gpu-layers 999
--ctx-size 131072
--jinja
--temp 0.6
--top-p 0.95
--cache-type-k q8_0
--cache-type-v q8_0
--flash-attn on
aliases:
- nemotron-3-nano
# ---------------------------------------------------------------------------
# SOTA Models
# ---------------------------------------------------------------------------
# https://docs.unsloth.ai/models/gpt-oss-how-to-run-and-fine-tune
# dont use cache quant, seems to impact performance
# performance: 25..30tps
gpt-oss-120b:
cmd: >
llama-server --port ${PORT}
--model /models/gpt-oss-120b-MXFP4_MOE.gguf
--n-gpu-layers 999
--ctx-size 65536
--flash-attn on
-ot ".ffn_(up)_exps.=CPU"
--threads -1
--temp 1.0
--min-p 0.0
--top-p 1.0
--top-k 0.0
--chat-template-kwargs "{\"reasoning_effort\": \"high\"}"
# https://docs.unsloth.ai/models/glm-4.6-how-to-run-locally
# For q8_0 cache, the max is 64k context size
# For iq4_nl cache, the max is 128k context size
GLM-4.5-Air-IQ4_XS:
cmd: >
llama-server --port ${PORT}
--model /models/GLM-4.5-Air-IQ4_XS-00001-of-00002.gguf
--jinja
--n-gpu-layers 999
-ot ".ffn_(up)_exps.=CPU"
--ctx-size 65536
--temp 1.0
--min-p 0.0
--top-p 0.95
--top-k 40
--cache-type-k q8_0
--cache-type-v q8_0
--flash-attn on
# https://docs.unsloth.ai/models/glm-4.6-how-to-run-locally
# With mmproj and iq4_nl, the max is 32k context size but slow
# With mmproj and q4_0, the max is 16k context size but is affected by the uploaded image size
GLM-4.6V-IQ4_XS:
cmd: >
llama-server --port ${PORT}
--model /models/GLM-4.6V-IQ4_XS-00001-of-00002.gguf
--mmproj /models/mmproj-GLM-4.6V-F16.gguf
--jinja
--n-gpu-layers 999
-ot ".ffn_(up)_exps.=CPU"
--ctx-size 32768
--temp 1.0
--min-p 0.0
--top-p 0.95
--top-k 40
--cache-type-k iq4_nl
--cache-type-v iq4_nl
--flash-attn on
# https://docs.unsloth.ai/models/qwen3-next
Qwen3-Next-80B-A3B-Thinking-Q4_K_M:
cmd: >
llama-server --port ${PORT}
--model /models/Qwen3-Next-80B-A3B-Thinking-Q4_K_M.gguf
--n-gpu-layers 999
--n-cpu-moe 2
--ctx-size 65536
--cache-type-k q8_0
--cache-type-v q8_0
--temp 0.6
--min-p 0.0
--top-p 0.80
--top-k 20
--flash-attn on
# https://docs.unsloth.ai/models/qwen3-next
Qwen3-Next-80B-A3B-Instruct-Q4_K_M:
cmd: |
llama-server --port ${PORT}
--model /models/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf
--n-gpu-layers 999
--n-cpu-moe 2
--ctx-size 65536
--temp 0.7
--min-p 0.0
--top-p 0.8
--top-k 20
--cache-type-k q8_0
--cache-type-v q8_0
--flash-attn on
# https://docs.unsloth.ai/models/qwen3-vl-how-to-run-and-fine-tune
Qwen3-VL-30B-A3B-Instruct-Q8_0:
cmd: >
llama-server --port ${PORT}
--model /models/Qwen3-VL-30B-A3B-Instruct-Q8_0.gguf
--mmproj /models/mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf
--n-gpu-layers 999
--ctx-size 81920
--top-p 0.8
--top-k 20
--temp 0.7
--min-p 0.0
--presence-penalty 1.5
--cache-type-k q8_0
--cache-type-v q8_0
--flash-attn on
# https://docs.unsloth.ai/models/qwen3-vl-how-to-run-and-fine-tune
Qwen3-VL-30B-A3B-Thinking-Q8_0:
cmd: >
llama-server --port ${PORT}
--model /models/Qwen3-VL-30B-A3B-Thinking-Q8_0.gguf
--mmproj /models/mmproj-Qwen3-VL-30B-A3B-Thinking-f16.gguf
--n-gpu-layers 999
--ctx-size 81920
--top-p 0.95
--top-k 20
--temp 1.0
--min-p 0.0
--presence-penalty 0.0
--cache-type-k q8_0
--cache-type-v q8_0
--flash-attn on
# ---------------------------------------------------------------------------
# Legacy Models
# ---------------------------------------------------------------------------
QwQ-32B-Q8_0:
cmd: >
llama-server --port ${PORT}
--model /models/QwQ-32B-Q8_0.gguf
--n-gpu-layers 999
--ctx-size 65536
--cache-type-k q8_0
--cache-type-v q8_0
--flash-attn on
Qwen2.5-72B-Instruct-Q4_K_M:
cmd: >
llama-server --port ${PORT}
--model /models/Qwen2.5-72B-Instruct-Q4_K_M.gguf
--n-gpu-layers 81
--ctx-size 16384
--cache-type-k q4_0
--cache-type-v q4_0
--flash-attn on
DeepSeek-R1-Distill-Llama-70B-Q4_K_M:
cmd: >
llama-server --port ${PORT}
--model /models/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf
--n-gpu-layers 999
--ctx-size 32768
--cache-type-k q8_0
--cache-type-v q8_0
--flash-attn on
Llama-3.3-70B-Instruct-Q4_K_M:
cmd: >
llama-server --port ${PORT}
--model /models/Llama-3.3-70B-Instruct-Q4_K_M.gguf
--n-gpu-layers 999
--ctx-size 32768
--cache-type-k q8_0
--cache-type-v q8_0
--flash-attn on
# https://docs.unsloth.ai/models/gemma-3-how-to-run-and-fine-tune
gemma-3-27b-it-Q8_0:
cmd: >
llama-server --port ${PORT}
--model /models/gemma-3-27b-it-Q8_0.gguf
--n-gpu-layers 999
--ctx-size 131072
--temp 1.0
--repeat-penalty 1.0
--min-p 0.01
--top-k 64
--top-p 0.95
--cache-type-k q8_0
--cache-type-v q8_0
--flash-attn on
Hope you like it :)
6 Comments
meganoob1337@reddit
Consistent-Being9844@reddit
meganoob1337@reddit
DAlmighty@reddit
meganoob1337@reddit
DAlmighty@reddit