Spent weekend tuning LLM server to hone my nerdism so you don't have to.

Posted by ChopSticksPlease@reddit | LocalLLaMA | View on Reddit | 6 comments

The Art: https://preview.redd.it/1rdwk3yykq8g1.jpg?width=2494&format=pjpg&auto=webp&s=562c0dcecf89a3227a2627572e902afca5384bfb Tl;dr; I've spent some time setting up local AI server with various models for chat and agentic coding in VS code + Cline. The goal was to replace Ollama with llama.cpp and squeeze as much performance as I can from the hardware (Dual RTX 3090 + CPU). The llama-swap configuration with llama.cpp command and options and some extra information is here in the repo: [https://github.com/cepa/llama-nerd](https://github.com/cepa/llama-nerd) You can consider this a sample or a reference, it should work if you have 48+ GB of VRAM but you can scale it up or down by adjusting quant and context size in most models. I guess that config may be helpful for some of you who want to ditch ollama for good. The Artist: https://preview.redd.it/kdikr0zgmq8g1.jpg?width=1080&format=pjpg&auto=webp&s=6c500bd772de3ea9be6e8f1f47d542fcf45d2611 The llama-swap config: # llama-swap-config.yaml # Hardware: # Dell T7910 # GPU: 2x NVIDIA RTX 3090 (Total 48GB VRAM) # CPU: 2x Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz (2 Sockets x 40 Cores) # RAM: 256GB DDR4 # Virtual Machine: # OS: Ubuntu + Nvidia CUDA Drivers # vCPU: 40 Cores # RAM: 64GB # GPU: 2x NVIDIA RTX 3090 (48GB VRAM) (PCIe Passthrough) # Disk: 1TB NVMe (PCIe Passthrough) # NUMA: To DISABLE NUMA, the VM is pinned to physical CPU0 with 64GB RAM and both GPUs. models: # --------------------------------------------------------------------------- # Coding models # --------------------------------------------------------------------------- # https://huggingface.co/unsloth/Seed-OSS-36B-Instruct-GGUF # https://huggingface.co/magiccodingman/Seed-OSS-36B-Instruct-unsloth-MagicQuant-Hybrid-GGUF # Q6_K_XL with quantized 96k context size to fit in the 48GB VRAM for speed Seed-OSS-36B-Instruct-UD-Q5_K_XL: cmd: > llama-server --port ${PORT} --model /models/Seed-OSS-36B-Instruct-UD-Q5_K_XL.gguf --n-gpu-layers 999 --ctx-size 131072 --temp 1.1 --top-p 0.95 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on aliases: - seed-oss # https://docs.unsloth.ai/models/qwen3-coder-how-to-run-locally Qwen3-Coder-30B-A3B-Instruct-Q8_0: cmd: > llama-server --port ${PORT} --model /models/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --n-gpu-layers 999 --ctx-size 131072 --temp 0.2 --min-p 0.0 --top-p 0.90 --top-k 20 --repeat-penalty 1.05 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on aliases: - qwen3-coder Qwen2.5-Coder-32B-Instruct-Q8_0: cmd: > llama-server --port ${PORT} --model /models/Qwen2.5-Coder-32B-Instruct-Q8_0.gguf --n-gpu-layers 999 --ctx-size 131072 --temp 0.2 --min-p 0.0 --top-p 0.90 --top-k 20 --repeat-penalty 1.05 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on aliases: - qwen2.5-coder # https://docs.unsloth.ai/models/devstral-2 Devstral-Small-2-24B-Instruct-2512-Q8_0: cmd: > llama-server --port ${PORT} --model /models/Devstral-Small-2-24B-Instruct-2512-Q8_0.gguf --mmproj /models/mmproj-Devstral-Small-2-24B-Instruct-2512-F16.gguf --n-gpu-layers 999 --ctx-size 131072 --jinja --temp 0.15 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on aliases: - devstral-small-2 # Devstral is a dense model, 123b works 2tps or less. #Devstral-2-123B-Instruct-2512-IQ4_XS: # cmd: > # llama-server --port ${PORT} # --model /models/Devstral-2-123B-Instruct-2512-IQ4_XS-00001-of-00002.gguf # --n-gpu-layers 58 # --ctx-size 32768 # --jinja # --temp 0.15 # --cache-type-k q4_0 # --cache-type-v q4_0 # https://docs.unsloth.ai/models/nemotron-3 Nemotron-3-Nano-30B-A3B-Q8_0: cmd: > llama-server --port ${PORT} --model /models/Nemotron-3-Nano-30B-A3B-Q8_0.gguf --n-gpu-layers 999 --ctx-size 131072 --jinja --temp 0.6 --top-p 0.95 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on aliases: - nemotron-3-nano # --------------------------------------------------------------------------- # SOTA Models # --------------------------------------------------------------------------- # https://docs.unsloth.ai/models/gpt-oss-how-to-run-and-fine-tune # dont use cache quant, seems to impact performance # performance: 25..30tps gpt-oss-120b: cmd: > llama-server --port ${PORT} --model /models/gpt-oss-120b-MXFP4_MOE.gguf --n-gpu-layers 999 --ctx-size 65536 --flash-attn on -ot ".ffn_(up)_exps.=CPU" --threads -1 --temp 1.0 --min-p 0.0 --top-p 1.0 --top-k 0.0 --chat-template-kwargs "{\"reasoning_effort\": \"high\"}" # https://docs.unsloth.ai/models/glm-4.6-how-to-run-locally # For q8_0 cache, the max is 64k context size # For iq4_nl cache, the max is 128k context size GLM-4.5-Air-IQ4_XS: cmd: > llama-server --port ${PORT} --model /models/GLM-4.5-Air-IQ4_XS-00001-of-00002.gguf --jinja --n-gpu-layers 999 -ot ".ffn_(up)_exps.=CPU" --ctx-size 65536 --temp 1.0 --min-p 0.0 --top-p 0.95 --top-k 40 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on # https://docs.unsloth.ai/models/glm-4.6-how-to-run-locally # With mmproj and iq4_nl, the max is 32k context size but slow # With mmproj and q4_0, the max is 16k context size but is affected by the uploaded image size GLM-4.6V-IQ4_XS: cmd: > llama-server --port ${PORT} --model /models/GLM-4.6V-IQ4_XS-00001-of-00002.gguf --mmproj /models/mmproj-GLM-4.6V-F16.gguf --jinja --n-gpu-layers 999 -ot ".ffn_(up)_exps.=CPU" --ctx-size 32768 --temp 1.0 --min-p 0.0 --top-p 0.95 --top-k 40 --cache-type-k iq4_nl --cache-type-v iq4_nl --flash-attn on # https://docs.unsloth.ai/models/qwen3-next Qwen3-Next-80B-A3B-Thinking-Q4_K_M: cmd: > llama-server --port ${PORT} --model /models/Qwen3-Next-80B-A3B-Thinking-Q4_K_M.gguf --n-gpu-layers 999 --n-cpu-moe 2 --ctx-size 65536 --cache-type-k q8_0 --cache-type-v q8_0 --temp 0.6 --min-p 0.0 --top-p 0.80 --top-k 20 --flash-attn on # https://docs.unsloth.ai/models/qwen3-next Qwen3-Next-80B-A3B-Instruct-Q4_K_M: cmd: | llama-server --port ${PORT} --model /models/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf --n-gpu-layers 999 --n-cpu-moe 2 --ctx-size 65536 --temp 0.7 --min-p 0.0 --top-p 0.8 --top-k 20 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on # https://docs.unsloth.ai/models/qwen3-vl-how-to-run-and-fine-tune Qwen3-VL-30B-A3B-Instruct-Q8_0: cmd: > llama-server --port ${PORT} --model /models/Qwen3-VL-30B-A3B-Instruct-Q8_0.gguf --mmproj /models/mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf --n-gpu-layers 999 --ctx-size 81920 --top-p 0.8 --top-k 20 --temp 0.7 --min-p 0.0 --presence-penalty 1.5 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on # https://docs.unsloth.ai/models/qwen3-vl-how-to-run-and-fine-tune Qwen3-VL-30B-A3B-Thinking-Q8_0: cmd: > llama-server --port ${PORT} --model /models/Qwen3-VL-30B-A3B-Thinking-Q8_0.gguf --mmproj /models/mmproj-Qwen3-VL-30B-A3B-Thinking-f16.gguf --n-gpu-layers 999 --ctx-size 81920 --top-p 0.95 --top-k 20 --temp 1.0 --min-p 0.0 --presence-penalty 0.0 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on # --------------------------------------------------------------------------- # Legacy Models # --------------------------------------------------------------------------- QwQ-32B-Q8_0: cmd: > llama-server --port ${PORT} --model /models/QwQ-32B-Q8_0.gguf --n-gpu-layers 999 --ctx-size 65536 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on Qwen2.5-72B-Instruct-Q4_K_M: cmd: > llama-server --port ${PORT} --model /models/Qwen2.5-72B-Instruct-Q4_K_M.gguf --n-gpu-layers 81 --ctx-size 16384 --cache-type-k q4_0 --cache-type-v q4_0 --flash-attn on DeepSeek-R1-Distill-Llama-70B-Q4_K_M: cmd: > llama-server --port ${PORT} --model /models/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf --n-gpu-layers 999 --ctx-size 32768 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on Llama-3.3-70B-Instruct-Q4_K_M: cmd: > llama-server --port ${PORT} --model /models/Llama-3.3-70B-Instruct-Q4_K_M.gguf --n-gpu-layers 999 --ctx-size 32768 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on # https://docs.unsloth.ai/models/gemma-3-how-to-run-and-fine-tune gemma-3-27b-it-Q8_0: cmd: > llama-server --port ${PORT} --model /models/gemma-3-27b-it-Q8_0.gguf --n-gpu-layers 999 --ctx-size 131072 --temp 1.0 --repeat-penalty 1.0 --min-p 0.01 --top-k 64 --top-p 0.95 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on Hope you like it :)