Spent weekend tuning LLM server to hone my nerdism so you don't have to.

Posted by ChopSticksPlease@reddit | LocalLLaMA | View on Reddit | 6 comments

The Art: https://preview.redd.it/1rdwk3yykq8g1.jpg?width=2494&format=pjpg&auto=webp&s=562c0dcecf89a3227a2627572e902afca5384bfb Tl;dr; I've spent some time setting up local AI server with various models for chat and agentic coding in VS code + Cline. The goal was to replace Ollama with llama.cpp and squeeze as much performance as I can from the hardware (Dual RTX 3090 + CPU). The llama-swap configuration with llama.cpp command and options and some extra information is here in the repo: [https://github.com/cepa/llama-nerd](https://github.com/cepa/llama-nerd) You can consider this a sample or a reference, it should work if you have 48+ GB of VRAM but you can scale it up or down by adjusting quant and context size in most models. I guess that config may be helpful for some of you who want to ditch ollama for good. The Artist: https://preview.redd.it/kdikr0zgmq8g1.jpg?width=1080&format=pjpg&auto=webp&s=6c500bd772de3ea9be6e8f1f47d542fcf45d2611 The llama-swap config: # llama-swap-config.yaml # Hardware: # Dell T7910 # GPU: 2x NVIDIA RTX 3090 (Total 48GB VRAM) # CPU: 2x Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz (2 Sockets x 40 Cores) # RAM: 256GB DDR4 # Virtual Machine: # OS: Ubuntu + Nvidia CUDA Drivers # vCPU: 40 Cores # RAM: 64GB # GPU: 2x NVIDIA RTX 3090 (48GB VRAM) (PCIe Passthrough) # Disk: 1TB NVMe (PCIe Passthrough) # NUMA: To DISABLE NUMA, the VM is pinned to physical CPU0 with 64GB RAM and both GPUs. models: # --------------------------------------------------------------------------- # Coding models # --------------------------------------------------------------------------- # https://huggingface.co/unsloth/Seed-OSS-36B-Instruct-GGUF # https://huggingface.co/magiccodingman/Seed-OSS-36B-Instruct-unsloth-MagicQuant-Hybrid-GGUF # Q6_K_XL with quantized 96k context size to fit in the 48GB VRAM for speed Seed-OSS-36B-Instruct-UD-Q5_K_XL: cmd: > llama-server --port ${PORT} --model /models/Seed-OSS-36B-Instruct-UD-Q5_K_XL.gguf --n-gpu-layers 999 --ctx-size 131072 --temp 1.1 --top-p 0.95 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on aliases: - seed-oss # https://docs.unsloth.ai/models/qwen3-coder-how-to-run-locally Qwen3-Coder-30B-A3B-Instruct-Q8_0: cmd: > llama-server --port ${PORT} --model /models/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --n-gpu-layers 999 --ctx-size 131072 --temp 0.2 --min-p 0.0 --top-p 0.90 --top-k 20 --repeat-penalty 1.05 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on aliases: - qwen3-coder Qwen2.5-Coder-32B-Instruct-Q8_0: cmd: > llama-server --port ${PORT} --model /models/Qwen2.5-Coder-32B-Instruct-Q8_0.gguf --n-gpu-layers 999 --ctx-size 131072 --temp 0.2 --min-p 0.0 --top-p 0.90 --top-k 20 --repeat-penalty 1.05 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on aliases: - qwen2.5-coder # https://docs.unsloth.ai/models/devstral-2 Devstral-Small-2-24B-Instruct-2512-Q8_0: cmd: > llama-server --port ${PORT} --model /models/Devstral-Small-2-24B-Instruct-2512-Q8_0.gguf --mmproj /models/mmproj-Devstral-Small-2-24B-Instruct-2512-F16.gguf --n-gpu-layers 999 --ctx-size 131072 --jinja --temp 0.15 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on aliases: - devstral-small-2 # Devstral is a dense model, 123b works 2tps or less. #Devstral-2-123B-Instruct-2512-IQ4_XS: # cmd: > # llama-server --port ${PORT} # --model /models/Devstral-2-123B-Instruct-2512-IQ4_XS-00001-of-00002.gguf # --n-gpu-layers 58 # --ctx-size 32768 # --jinja # --temp 0.15 # --cache-type-k q4_0 # --cache-type-v q4_0 # https://docs.unsloth.ai/models/nemotron-3 Nemotron-3-Nano-30B-A3B-Q8_0: cmd: > llama-server --port ${PORT} --model /models/Nemotron-3-Nano-30B-A3B-Q8_0.gguf --n-gpu-layers 999 --ctx-size 131072 --jinja --temp 0.6 --top-p 0.95 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on aliases: - nemotron-3-nano # --------------------------------------------------------------------------- # SOTA Models # --------------------------------------------------------------------------- # https://docs.unsloth.ai/models/gpt-oss-how-to-run-and-fine-tune # dont use cache quant, seems to impact performance # performance: 25..30tps gpt-oss-120b: cmd: > llama-server --port ${PORT} --model /models/gpt-oss-120b-MXFP4_MOE.gguf --n-gpu-layers 999 --ctx-size 65536 --flash-attn on -ot ".ffn_(up)_exps.=CPU" --threads -1 --temp 1.0 --min-p 0.0 --top-p 1.0 --top-k 0.0 --chat-template-kwargs "{\"reasoning_effort\": \"high\"}" # https://docs.unsloth.ai/models/glm-4.6-how-to-run-locally # For q8_0 cache, the max is 64k context size # For iq4_nl cache, the max is 128k context size GLM-4.5-Air-IQ4_XS: cmd: > llama-server --port ${PORT} --model /models/GLM-4.5-Air-IQ4_XS-00001-of-00002.gguf --jinja --n-gpu-layers 999 -ot ".ffn_(up)_exps.=CPU" --ctx-size 65536 --temp 1.0 --min-p 0.0 --top-p 0.95 --top-k 40 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on # https://docs.unsloth.ai/models/glm-4.6-how-to-run-locally # With mmproj and iq4_nl, the max is 32k context size but slow # With mmproj and q4_0, the max is 16k context size but is affected by the uploaded image size GLM-4.6V-IQ4_XS: cmd: > llama-server --port ${PORT} --model /models/GLM-4.6V-IQ4_XS-00001-of-00002.gguf --mmproj /models/mmproj-GLM-4.6V-F16.gguf --jinja --n-gpu-layers 999 -ot ".ffn_(up)_exps.=CPU" --ctx-size 32768 --temp 1.0 --min-p 0.0 --top-p 0.95 --top-k 40 --cache-type-k iq4_nl --cache-type-v iq4_nl --flash-attn on # https://docs.unsloth.ai/models/qwen3-next Qwen3-Next-80B-A3B-Thinking-Q4_K_M: cmd: > llama-server --port ${PORT} --model /models/Qwen3-Next-80B-A3B-Thinking-Q4_K_M.gguf --n-gpu-layers 999 --n-cpu-moe 2 --ctx-size 65536 --cache-type-k q8_0 --cache-type-v q8_0 --temp 0.6 --min-p 0.0 --top-p 0.80 --top-k 20 --flash-attn on # https://docs.unsloth.ai/models/qwen3-next Qwen3-Next-80B-A3B-Instruct-Q4_K_M: cmd: | llama-server --port ${PORT} --model /models/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf --n-gpu-layers 999 --n-cpu-moe 2 --ctx-size 65536 --temp 0.7 --min-p 0.0 --top-p 0.8 --top-k 20 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on # https://docs.unsloth.ai/models/qwen3-vl-how-to-run-and-fine-tune Qwen3-VL-30B-A3B-Instruct-Q8_0: cmd: > llama-server --port ${PORT} --model /models/Qwen3-VL-30B-A3B-Instruct-Q8_0.gguf --mmproj /models/mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf --n-gpu-layers 999 --ctx-size 81920 --top-p 0.8 --top-k 20 --temp 0.7 --min-p 0.0 --presence-penalty 1.5 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on # https://docs.unsloth.ai/models/qwen3-vl-how-to-run-and-fine-tune Qwen3-VL-30B-A3B-Thinking-Q8_0: cmd: > llama-server --port ${PORT} --model /models/Qwen3-VL-30B-A3B-Thinking-Q8_0.gguf --mmproj /models/mmproj-Qwen3-VL-30B-A3B-Thinking-f16.gguf --n-gpu-layers 999 --ctx-size 81920 --top-p 0.95 --top-k 20 --temp 1.0 --min-p 0.0 --presence-penalty 0.0 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on # --------------------------------------------------------------------------- # Legacy Models # --------------------------------------------------------------------------- QwQ-32B-Q8_0: cmd: > llama-server --port ${PORT} --model /models/QwQ-32B-Q8_0.gguf --n-gpu-layers 999 --ctx-size 65536 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on Qwen2.5-72B-Instruct-Q4_K_M: cmd: > llama-server --port ${PORT} --model /models/Qwen2.5-72B-Instruct-Q4_K_M.gguf --n-gpu-layers 81 --ctx-size 16384 --cache-type-k q4_0 --cache-type-v q4_0 --flash-attn on DeepSeek-R1-Distill-Llama-70B-Q4_K_M: cmd: > llama-server --port ${PORT} --model /models/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf --n-gpu-layers 999 --ctx-size 32768 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on Llama-3.3-70B-Instruct-Q4_K_M: cmd: > llama-server --port ${PORT} --model /models/Llama-3.3-70B-Instruct-Q4_K_M.gguf --n-gpu-layers 999 --ctx-size 32768 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on # https://docs.unsloth.ai/models/gemma-3-how-to-run-and-fine-tune gemma-3-27b-it-Q8_0: cmd: > llama-server --port ${PORT} --model /models/gemma-3-27b-it-Q8_0.gguf --n-gpu-layers 999 --ctx-size 131072 --temp 1.0 --repeat-penalty 1.0 --min-p 0.01 --top-k 64 --top-p 0.95 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on Hope you like it :)

6 Comments

[-]

meganoob1337@reddit

Just food for thought, I'm running a dual 3090 as well with llama-swap, but you can also use vllm inside llama-swap , if you want i can give you an example config :)

Consistent-Being9844@reddit

That would be awesome actually! I've been curious about vllm but haven't made the jump yet since llama.cpp has been working well enough. How's the performance comparison in your experience with the dual 3090s?

Look at the other answer I posted in this thread, I added an example config. I think in general the performance is not THAT different for one concurrent request, but I sometimes run some scripts that run concurrent requests and there it does shine. The only annoying thing is that you don't get per request metrics like t/s or pp/s directly in the request, but in the logs you get an overall t/s pp/s and currently running requests. It might be it has a better latency as well but I never really tested the difference.

DAlmighty@reddit

I’d be interested in a vLLM config. I’ve been far too lazy to do it myself hahaha

qwen-coder-30b: # Start command for Qwen 30B Coder cmd: | docker run --rm \ --gpus all \ --network llama-swap_llama-swap \ --name vllm-${PORT} \ --shm-size 15gb \ -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ -e NVIDIA_VISIBLE_DEVICES=all \ -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \ -e HUGGING_FACE_HUB_TOKEN=<<huggingface_token_if_needed>>\ -e VLLM_SLEEP_WHEN_IDLE=1 \ -v /home/meganoob1337/projects/ollama/models:/root/models \ -v /home/meganoob1337/.cache/huggingface/hub:/root/.cache/huggingface/hub/ \ -v /home/meganoob1337/projects/ollama/vllm_cache2:/root/.cache/vllm \ vllm/vllm-openai:v0.11.2 \ --model cpatonn/Qwen3-Coder-30B-A3B-Instruct-AWQ-4bit \ --trust-remote-code \ --uvicorn-log-level info \ --gpu-memory-utilization 0.75 \ --tensor-parallel-size 2 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --max-model-len 128000 \ --dtype auto \ --port ${PORT} \ --host 0.0.0.0 useModelName: cpatonn/Qwen3-Coder-30B-A3B-Instruct-AWQ-4bit cmdStop: | docker stop vllm-${PORT} || true checkEndpoint: /v1/models proxy: http://vllm-${PORT}:${PORT} type: "proxy" The models mount is not neccecary i think the huggingface hub is to persist the downloaded models (feel free to take another folder i took my hosts huggingface cache and the vllm cache is for persisting vllm cache for faster subsequent startups (i think that it caches some cuda graphs or something :D ) huggingface token only needed for gated models

Many many thanks!

Reply to Post

6 Comments