Need help with llama.cpp Qwen3.6 configuration on a single 3090 w/ 48GB RAM
Posted by valmist@reddit | LocalLLaMA | View on Reddit | 6 comments
Hey there,
I have been testing models locally, but this is the first model that got me interested in understanding llama.cpp in more detail. I have noticeable stuttering when I run the model as it fills the VRAM completely, and I am sure I need to understand which flags I should understand better to gain better performance. Here is a sample log when I run the model with the coding config variant. I am using the llama-server router capabilities with a config.ini file, so here is my llama.cpp config:
; Qwen3.6 35B A3B - general tasks (thinking)
[qwen3.6-35b-a3b-general]
model = /home/valmist/Storage/LLMs/qwen3.6-35b-a3b-ud-q4-k-xl/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
mmproj = /home/valmist/Storage/LLMs/qwen3.6-35b-a3b-ud-q4-k-xl/mmproj-F16.gguf
; --fit system handles ngl automatically, no manual n-cpu-moe needed
fit = true
fit-target = 3072
fit-ctx = 131072
; thinking config
reasoning = on
chat-template-kwargs = {"preserve_thinking":true}
flash-attn = true
temp = 1.0
top-p = 0.95
top-k = 20
min-p = 0.0
presence-penalty = 1.5
repeat-penalty = 1.0
; performance config
no-mmap = true
parallel = 1
cache-type-k = q8_0
cache-type-v = q8_0
batch-size = 2048
ubatch-size = 1024
; Qwen3.6 35B A3B - precise coding (thinking)
[qwen3.6-35b-a3b-coding]
model = /home/valmist/Storage/LLMs/qwen3.6-35b-a3b-ud-q4-k-xl/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
mmproj = /home/valmist/Storage/LLMs/qwen3.6-35b-a3b-ud-q4-k-xl/mmproj-F16.gguf
; --fit system handles ngl automatically, no manual n-cpu-moe needed
fit = true
fit-target = 3072
fit-ctx = 131072
; thinking config
reasoning = on
chat-template-kwargs = {"preserve_thinking":true}
flash-attn = true
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.0
presence-penalty = 0.0
repeat-penalty = 1.0
; performance config
no-mmap = true
parallel = 1
cache-type-k = q8_0
cache-type-v = q8_0
batch-size = 2048
ubatch-size = 1024
And here are my system specs:
OS: CachyOS x86_64
Host: B850 EAGLE WIFI6E (Default string-CF-ADO)
Kernel: Linux 7.0.0-1-cachyos
Display (MSI3DD3): 3840x2160 @ 1.45x in 32", 240 Hz [External]
DE: KDE Plasma 6.6.4
CPU: AMD Ryzen 7 9800X3D (16) @ 5.27 GHz
GPU: NVIDIA GeForce RTX 3090 [Discrete]
Memory: 13.39 GiB / 46.65 GiB (29%)
Disk (/): 546.10 GiB / 929.51 GiB (59%) - btrfs
Disk (/mnt/storage): 667.44 GiB / 1.79 TiB (36%) - ext4
Here is the nvidia-smi output when a model is loaded (I know CUDA 13.2 is not recommended, I want to solve the server part first):
~
❯
nvidia-smi
Tue Apr 21 23:26:02 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03 Driver Version: 595.58.03 CUDA Version: 13.2 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 On | N/A |
| 30% 46C P3 84W / 350W | 23931MiB / 24576MiB | 12% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
My question would be what am I doing wrong? I am being too generous with some values for sure (probably the --fit-target is one of them), but I need to understand what flags impact performance the most and why, and if maybe someone can point me in the right direction so that I can continue configuring and testing this myself.
Thanks in advance, let me know if you need more information.
andy2na@reddit
use llama-swap +llama.cpp so you dont have to waste VRAM having both models up, huge waste. llama-swap allows you to switch between parameters without reloading the model
heres my llamaswap, you can fit the whole 256k context with IQ4_NL quant
valmist@reddit (OP)
This helped a lot, no more stuttering, the model is much faster (getting 80-90 t/s), and the max context is a great bonus! Thanks for this, I will read and research these settings to understand why and how this all works.
No-Statement-0001@reddit
llama-swap supports an
envmodel configuration that is an array of environment variables to set. Here is my config for running Qwen 3.6 35B Q8 over dual 3090s. Notice how the CUDA_VISIBLE_DEVICES environment variable is set.Clean_Initial_9618@reddit
How can I setup llama-swap ?
andy2na@reddit
install via docker or similar: https://github.com/mostlygeek/llama-swap
SummarizedAnu@reddit
Just use -m model.gguf -ctv q8_0 -ctk q8_0 -C 65534 And increase context until it's close enough The fit ones do it auto matically .