My setup for running Qwen3.6-35B-A3B-UD-Q4_K_M on single RX7900XT (20GB VRAM)
Posted by hlacik@reddit | LocalLLaMA | View on Reddit | 21 comments
I am running it on ubuntu 24.04 (in docker) i am building it using official dockerfile of llama-cpp (https://github.com/ggml-org/llama.cpp/blob/master/.devops/rocm.Dockerfile) only changing rocm to 7.2.2
this is my llama-server (via docker-compose) config:
services:
llama-cpp:
container_name: llama-cpp
build:
context: ./llama.cpp
dockerfile: .devops/rocm.Dockerfile
target: server
image: llama-cpp-server:rocm-7.2.2
ports:
- 8080:8080
devices:
- /dev/dri
- /dev/kfd
ipc: host
volumes:
- ./.models:/models
command: >
--model /models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.00
--presence-penalty 0.0
--repeat-penalty 1.0
--ctx-size 131072
--parallel 2
--fit-target 4096
--no-mmap
--flash-attn on
--cache-type-k q4_0
--cache-type-v q4_0
--batch-size 1024
--ubatch-size 256
i am getting nice
generation: \~31–33 tok/s
prompt eval: \~245 tok/s
also i am using it for opencode.ai where parallel 2 allow for subagents to use both 64k context window.
also my GPU is also used to render desktop (KDE) therefore i have decided to use --fit-target 4096 (to have always 4G VRAM free) instead of specifying how many layers to offload to gpu / cpu
is there someone with similar setup who can elaborate?
PS: HW is RX7900XT, on ubuntu 24.04 (docker), and 64GB DDR4 RAM
CPU is Ryzen 5700XT
Monad_Maya@reddit
You're "bleeding" to the CPU as the 27B dense can achieve 31/32 tps on 7900XT (in my own testing).
You will notice issues due to this high quantization on the KV cache, stick to Q8 or higher.
I suggest that you use IQ4_XS quant - https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF. Check your VRAM utilization and if possible move to Q8 KV cache.
Or you can use the 27B dense IQ4_XS for slightly difficult tasks.
leonbollerup@reddit
27b is slow af compared to 35b :-/ .. even on my 5090m i get around 50 tok/sek
Monad_Maya@reddit
I find the output quality to be discernibly better with the 27B, agreed on the speed.
I don't mind 30 tps generation but the prompt processing speed is quite slow.
TerminalNoop@reddit
Well with a 7900XTX and 32gb RAM I'm asking myself what is better:
The 27b as a 4bit quant or the 35b as a 8bit quant?
lloyd08@reddit
Try and see. I think everyone confidently telling everyone else what model to use is some “works on my computer” type of logic. I’ve found a ton of uses for running 9B. You don’t need all the knowledge in the universe to answer “is this a hotdog”
TerminalNoop@reddit
nobody would be asking if this or that model is better if it was all about is this a hot dog? Then you can just ask ChatGPT for free.
lloyd08@reddit
My point is, nobody knows how you use it, so any advice is based on how someone else uses it.
Monad_Maya@reddit
Good question actually.
The speed in this configuration would roughly be the same.
I suggest that you opt for 4bit 35B (not 8bit) for speed since it'll offload to VRAM just fine.
Test drive it for a bit, see if you can notice some errors and grade your overall experience.
27B dense feels a bit smarter but can occasionally be terse or to the point. Mostly ok with me. The PP speed is not good for large codebases.
If you're looking for non coding use then Gemma4 all the way.
TerminalNoop@reddit
Interesting, I heard that the MoE models degrade a lot more when at lower quants compared to dense models. I get about \~28 tok/s at 130k (mostly empty) context with the 35B MoE at 8bit.
The dense one kinda doesn't fit longer context except if I take a really small 4bit quant.
dero_name@reddit
Seems low. How fast is your RAM?
If I were you, I would purchase a cheap-ish secondary GPU just to render desktop and use the full VRAM capacity of the XT with something like UD-Q3_K_S (15.4 GB), reaching easily 100+ tps decode.
hlacik@reddit (OP)
its DDR4 - 3200Mhz with Ryzen 5700X
dero_name@reddit
Not suggesting you should buy a new rig.
Just a cheap GPU like AMD Radeon RX 6400 to render your desktop, freeing the whole 20 GB of your XT VRAM for the model. You could fit it entirely in 20 GB with decent context.
hlacik@reddit (OP)
yeah i am on mITX board with only 1 gpu slot, i do understand your point tho and it completely makes sense .
lloyd08@reddit
I have a nearly identical setup, 7900xt w/ 3800x, except only 32GB RAM. I use vulkan on ubuntu headless and connect to it from my laptop, and get:
prompt: 650 +/- 400 tps*
eval: 55 -> 40 tps
This is with 8 layers overflowing, once I get above that, there is a noticeable dropoff.
*I get low pp speeds randomly, typically on small prompts which makes me ignore the issue. it might be worth testing with various size prompts if you're benchmarking instead of assuming it's universal. I often only get 250 on my initial "hello world" style prompts when testing, but a serious prompt is usually 600-900. Here's two prompts in a row to demonstrate:
Settings when testing it:
Normally I run more context, but since I have my desktop and firefox open, I had to trim it down until it offloaded at most 8 layers. You may just want to tweak context until you have 8 or fewer layers offloaded. Given I'm running similar context at q8 k/v, it seems you're just killing your own speed and quality for no reason. Alternatively, I more frequently run the thicc boi 27b, but I run that one in np 1 so it might not be comparable to your use-case.
Glittering_Focus1538@reddit
This tracks, offloading 10 layers to cpu, even on a APEX mini version of qwen 3.6 i get 45 tok/s on my rx9070
JaredsBored@reddit
That's slow, especially the prompt processing speed. Why do you have the batch sizes set smaller than default? I'm guessing the batching and quantized KV are causing the slowdown.
hlacik@reddit (OP)
it was causing tearing in kde , but i have switched to vulkan and now i am keeping it on defaults
leonbollerup@reddit
i have a 5090m (olares one) and getting around 150 tok/sek with that model.. :)
arbv@reddit
Try
-b 3072 -ub 1536for better prompt processing speed.Gueleric@reddit
In my experience on Rocm --fit-target gives really bad performance. You should try to set --n-cpu-moe manually see if it improves your performance
Atul_Kumar_97@reddit
It's low but it's amd card I don't have that but I have rtx 4060 8gb vram + 32gb ram context size 160k , 50tok/sec to 40tok/sec drop upto 38tok/sec its good