My setup for running Qwen3.6-35B-A3B-UD-Q4_K_M on single RX7900XT (20GB VRAM)

Posted by hlacik@reddit | LocalLLaMA | View on Reddit | 21 comments

I am running it on ubuntu 24.04 (in docker) i am building it using official dockerfile of llama-cpp (https://github.com/ggml-org/llama.cpp/blob/master/.devops/rocm.Dockerfile) only changing rocm to 7.2.2

this is my llama-server (via docker-compose) config:

services:
  llama-cpp:
    container_name: llama-cpp
    build:
      context: ./llama.cpp
      dockerfile: .devops/rocm.Dockerfile
      target: server
    image: llama-cpp-server:rocm-7.2.2
    ports:
      - 8080:8080
    devices:
      - /dev/dri
      - /dev/kfd
    ipc: host
    volumes:
      - ./.models:/models
    command: >
      --model /models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
      --temp 0.6
      --top-p 0.95
      --top-k 20
      --min-p 0.00
      --presence-penalty 0.0
      --repeat-penalty 1.0


      --ctx-size 131072
      --parallel 2


      --fit-target 4096
      --no-mmap


      --flash-attn on


      --cache-type-k q4_0
      --cache-type-v q4_0


      --batch-size 1024
      --ubatch-size 256

i am getting nice
generation: \~31–33 tok/s
prompt eval: \~245 tok/s

also i am using it for opencode.ai where parallel 2 allow for subagents to use both 64k context window.

also my GPU is also used to render desktop (KDE) therefore i have decided to use --fit-target 4096 (to have always 4G VRAM free) instead of specifying how many layers to offload to gpu / cpu

is there someone with similar setup who can elaborate?

PS: HW is RX7900XT, on ubuntu 24.04 (docker), and 64GB DDR4 RAM
CPU is Ryzen 5700XT