Need help with llama.cpp Qwen3.6 configuration on a single 3090 w/ 48GB RAM

Posted by valmist@reddit | LocalLLaMA | View on Reddit | 6 comments

Hey there,

I have been testing models locally, but this is the first model that got me interested in understanding llama.cpp in more detail. I have noticeable stuttering when I run the model as it fills the VRAM completely, and I am sure I need to understand which flags I should understand better to gain better performance. Here is a sample log when I run the model with the coding config variant. I am using the llama-server router capabilities with a config.ini file, so here is my llama.cpp config:

; Qwen3.6 35B A3B - general tasks (thinking)
[qwen3.6-35b-a3b-general]
model = /home/valmist/Storage/LLMs/qwen3.6-35b-a3b-ud-q4-k-xl/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
mmproj = /home/valmist/Storage/LLMs/qwen3.6-35b-a3b-ud-q4-k-xl/mmproj-F16.gguf
; --fit system handles ngl automatically, no manual n-cpu-moe needed
fit = true
fit-target = 3072
fit-ctx = 131072
; thinking config
reasoning = on
chat-template-kwargs = {"preserve_thinking":true}
flash-attn = true
temp = 1.0
top-p = 0.95
top-k = 20
min-p = 0.0
presence-penalty = 1.5
repeat-penalty = 1.0
; performance config
no-mmap = true
parallel = 1
cache-type-k = q8_0
cache-type-v = q8_0
batch-size = 2048
ubatch-size = 1024


; Qwen3.6 35B A3B - precise coding (thinking)
[qwen3.6-35b-a3b-coding]
model = /home/valmist/Storage/LLMs/qwen3.6-35b-a3b-ud-q4-k-xl/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
mmproj = /home/valmist/Storage/LLMs/qwen3.6-35b-a3b-ud-q4-k-xl/mmproj-F16.gguf
; --fit system handles ngl automatically, no manual n-cpu-moe needed
fit = true
fit-target = 3072
fit-ctx = 131072
; thinking config
reasoning = on
chat-template-kwargs = {"preserve_thinking":true}
flash-attn = true
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.0
presence-penalty = 0.0
repeat-penalty = 1.0
; performance config
no-mmap = true
parallel = 1
cache-type-k = q8_0
cache-type-v = q8_0
batch-size = 2048
ubatch-size = 1024

And here are my system specs:

OS: CachyOS x86_64
Host: B850 EAGLE WIFI6E (Default string-CF-ADO)
Kernel: Linux 7.0.0-1-cachyos
Display (MSI3DD3): 3840x2160 @ 1.45x in 32", 240 Hz [External]
DE: KDE Plasma 6.6.4
CPU: AMD Ryzen 7 9800X3D (16) @ 5.27 GHz
GPU: NVIDIA GeForce RTX 3090 [Discrete]
Memory: 13.39 GiB / 46.65 GiB (29%)
Disk (/): 546.10 GiB / 929.51 GiB (59%) - btrfs
Disk (/mnt/storage): 667.44 GiB / 1.79 TiB (36%) - ext4

Here is the nvidia-smi output when a model is loaded (I know CUDA 13.2 is not recommended, I want to solve the server part first):

~ 
❯

nvidia-smi

Tue Apr 21 23:26:02 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03              Driver Version: 595.58.03      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:01:00.0  On |                  N/A |
| 30%   46C    P3             84W /  350W |   23931MiB /  24576MiB |     12%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

My question would be what am I doing wrong? I am being too generous with some values for sure (probably the --fit-target is one of them), but I need to understand what flags impact performance the most and why, and if maybe someone can point me in the right direction so that I can continue configuring and testing this myself.

Thanks in advance, let me know if you need more information.