Need help with llama.cpp Qwen3.6 configuration on a single 3090 w/ 48GB RAM

Posted by valmist@reddit | LocalLLaMA | View on Reddit | 6 comments

Hey there,

I have been testing models locally, but this is the first model that got me interested in understanding llama.cpp in more detail. I have noticeable stuttering when I run the model as it fills the VRAM completely, and I am sure I need to understand which flags I should understand better to gain better performance. Here is a sample log when I run the model with the coding config variant. I am using the llama-server router capabilities with a config.ini file, so here is my llama.cpp config:

; Qwen3.6 35B A3B - general tasks (thinking)
[qwen3.6-35b-a3b-general]
model = /home/valmist/Storage/LLMs/qwen3.6-35b-a3b-ud-q4-k-xl/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
mmproj = /home/valmist/Storage/LLMs/qwen3.6-35b-a3b-ud-q4-k-xl/mmproj-F16.gguf
; --fit system handles ngl automatically, no manual n-cpu-moe needed
fit = true
fit-target = 3072
fit-ctx = 131072
; thinking config
reasoning = on
chat-template-kwargs = {"preserve_thinking":true}
flash-attn = true
temp = 1.0
top-p = 0.95
top-k = 20
min-p = 0.0
presence-penalty = 1.5
repeat-penalty = 1.0
; performance config
no-mmap = true
parallel = 1
cache-type-k = q8_0
cache-type-v = q8_0
batch-size = 2048
ubatch-size = 1024


; Qwen3.6 35B A3B - precise coding (thinking)
[qwen3.6-35b-a3b-coding]
model = /home/valmist/Storage/LLMs/qwen3.6-35b-a3b-ud-q4-k-xl/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
mmproj = /home/valmist/Storage/LLMs/qwen3.6-35b-a3b-ud-q4-k-xl/mmproj-F16.gguf
; --fit system handles ngl automatically, no manual n-cpu-moe needed
fit = true
fit-target = 3072
fit-ctx = 131072
; thinking config
reasoning = on
chat-template-kwargs = {"preserve_thinking":true}
flash-attn = true
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.0
presence-penalty = 0.0
repeat-penalty = 1.0
; performance config
no-mmap = true
parallel = 1
cache-type-k = q8_0
cache-type-v = q8_0
batch-size = 2048
ubatch-size = 1024

And here are my system specs:

OS: CachyOS x86_64
Host: B850 EAGLE WIFI6E (Default string-CF-ADO)
Kernel: Linux 7.0.0-1-cachyos
Display (MSI3DD3): 3840x2160 @ 1.45x in 32", 240 Hz [External]
DE: KDE Plasma 6.6.4
CPU: AMD Ryzen 7 9800X3D (16) @ 5.27 GHz
GPU: NVIDIA GeForce RTX 3090 [Discrete]
Memory: 13.39 GiB / 46.65 GiB (29%)
Disk (/): 546.10 GiB / 929.51 GiB (59%) - btrfs
Disk (/mnt/storage): 667.44 GiB / 1.79 TiB (36%) - ext4

Here is the nvidia-smi output when a model is loaded (I know CUDA 13.2 is not recommended, I want to solve the server part first):

~ 
❯

nvidia-smi

Tue Apr 21 23:26:02 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03              Driver Version: 595.58.03      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:01:00.0  On |                  N/A |
| 30%   46C    P3             84W /  350W |   23931MiB /  24576MiB |     12%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

My question would be what am I doing wrong? I am being too generous with some values for sure (probably the --fit-target is one of them), but I need to understand what flags impact performance the most and why, and if maybe someone can point me in the right direction so that I can continue configuring and testing this myself.

Thanks in advance, let me know if you need more information.

[-]

andy2na@reddit

use llama-swap +llama.cpp so you dont have to waste VRAM having both models up, huge waste. llama-swap allows you to switch between parameters without reloading the model

heres my llamaswap, you can fit the whole 256k context with IQ4_NL quant

  "Qwen":
    cmd: >
      env CUDA_VISIBLE_DEVICES=0 /custom-bin/bin/llama-server
       --port ${PORT}
      --host 127.0.0.1
      --webui-mcp-proxy
      --model /models/qwen35/Qwen3.6-35B-A3B-IQ4_NL.gguf
      --mmproj /models/qwen35/qwen3.6-35b-mmproj-BF16.gguf
      --spec-type ngram-mod
      --spec-ngram-size-n 24
      --draft-min 48
      --draft-max 64
      --cache-type-k q8_0
      --cache-type-v q8_0
      --n-gpu-layers auto
      --split-mode none
      --main-gpu 0
      --threads 6
      --threads-batch 6
      --ctx-size 262144
      --image-min-tokens 1024
      --flash-attn on 
      --parallel 1 
      --jinja
    filters:
      stripParams: "temperature, top_p, top_k, min_p, presence_penalty, repeat_penalty"  
      setParamsByID:
        "${MODEL_ID}:thinking":
          chat_template_kwargs:
            enable_thinking: true
            preserve_thinking: true
          reasoning_budget: 4096
          temperature: 1.0
          top_p: 0.95
          top_k: 20
          min_p: 0.05
          presence_penalty: 1.5
          repeat_penalty: 1.0


        "${MODEL_ID}:thinking-coding":
          chat_template_kwargs:
            enable_thinking: true
            preserve_thinking: true
          temperature: 0.6
          top_p: 0.95
          top_k: 20
          min_p: 0.0
          presence_penalty: 0.0 
          repeat_penalty: 1.0


        "${MODEL_ID}:instruct":
          chat_template_kwargs:
            enable_thinking: false
            preserve_thinking: false
          temperature: 0.7
          top_p: 0.8
          top_k: 20
          min_p: 0.0
          presence_penalty: 1.5
          repeat_penalty: 1.0


        "${MODEL_ID}:instruct-reasoning":
          chat_template_kwargs:
            enable_thinking: false
            preserve_thinking: false
          temperature: 1.0
          top_p: 0.95
          top_k: 20
          min_p: 0.0
          presence_penalty: 1.5
          repeat_penalty: 1.0

[-]

valmist@reddit (OP)

This helped a lot, no more stuttering, the model is much faster (getting 80-90 t/s), and the max context is a great bonus! Thanks for this, I will read and research these settings to understand why and how this all works.

[-]

No-Statement-0001@reddit

llama-swap supports an env model configuration that is an array of environment variables to set. Here is my config for running Qwen 3.6 35B Q8 over dual 3090s. Notice how the CUDA_VISIBLE_DEVICES environment variable is set.

macros:
  "server-latest": |
    /path/to/llama-server/llama-server-latest
    --host 0.0.0.0 --port ${PORT}
    -ngl 999 -ngld 999
    --no-mmap
    --no-warmup

    # https://github.com/ggml-org/llama.cpp/pull/22223
    # ngram mod speculative decoding
    --spec-default
models:
  "Q3.6-35B":
    env:
      - "CUDA_VISIBLE_DEVICES=GPU-6f0,GPU-f10"
    filters:
      stripParams: "temperature, top_k, top_p, repeat_penalty, min_p, presence_penalty"
      setParamsByID:

        "${MODEL_ID}:thinking-coding":
          temperature: 0.6
          presence_penalty: 0.0
          chat_template_kwargs:
            preserve_thinking: true

        "${MODEL_ID}:instruct":
          chat_template_kwargs:
            enable_thinking: false
          temperature: 0.7
          top_p: 0.8

    cmd: |
      ${server-latest}
      --model /path/to/models/Qwen_Qwen3.6-35B-A3B-Q8_0.gguf
      --ctx-size 262144
      --fit off
      --temp 1.0 --min-p 0.0 --top-k 20 --top-p 0.95
      --repeat_penalty 1.0 --presence_penalty 1.5
      -np 1

[-]

Clean_Initial_9618@reddit

How can I setup llama-swap ?

[-]

andy2na@reddit

install via docker or similar: https://github.com/mostlygeek/llama-swap