My setup for running Qwen3.6-35B-A3B-UD-Q4_K_M on single RX7900XT (20GB VRAM)

[-]

Monad_Maya@reddit

You're "bleeding" to the CPU as the 27B dense can achieve 31/32 tps on 7900XT (in my own testing).

You will notice issues due to this high quantization on the KV cache, stick to Q8 or higher.

      --cache-type-k q4_0
      --cache-type-v q4_0

I suggest that you use IQ4_XS quant - https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF. Check your VRAM utilization and if possible move to Q8 KV cache.

Or you can use the 27B dense IQ4_XS for slightly difficult tasks.

[-]

leonbollerup@reddit

27b is slow af compared to 35b :-/ .. even on my 5090m i get around 50 tok/sek

[-]

Monad_Maya@reddit

I find the output quality to be discernibly better with the 27B, agreed on the speed.

I don't mind 30 tps generation but the prompt processing speed is quite slow.

[-]

TerminalNoop@reddit

Well with a 7900XTX and 32gb RAM I'm asking myself what is better:
The 27b as a 4bit quant or the 35b as a 8bit quant?

[-]

Try and see. I think everyone confidently telling everyone else what model to use is some “works on my computer” type of logic. I’ve found a ton of uses for running 9B. You don’t need all the knowledge in the universe to answer “is this a hotdog”

[-]

TerminalNoop@reddit

nobody would be asking if this or that model is better if it was all about is this a hot dog? Then you can just ask ChatGPT for free.

[-]

lloyd08@reddit

My point is, nobody knows how you use it, so any advice is based on how someone else uses it.

[-]

Monad_Maya@reddit

Good question actually.

The speed in this configuration would roughly be the same.

I suggest that you opt for 4bit 35B (not 8bit) for speed since it'll offload to VRAM just fine.

Test drive it for a bit, see if you can notice some errors and grade your overall experience.

27B dense feels a bit smarter but can occasionally be terse or to the point. Mostly ok with me. The PP speed is not good for large codebases.

If you're looking for non coding use then Gemma4 all the way.

[-]

TerminalNoop@reddit

Interesting, I heard that the MoE models degrade a lot more when at lower quants compared to dense models. I get about \~28 tok/s at 130k (mostly empty) context with the 35B MoE at 8bit.

The dense one kinda doesn't fit longer context except if I take a really small 4bit quant.

[-]

dero_name@reddit

Seems low. How fast is your RAM?

If I were you, I would purchase a cheap-ish secondary GPU just to render desktop and use the full VRAM capacity of the XT with something like UD-Q3_K_S (15.4 GB), reaching easily 100+ tps decode.

[-]

hlacik@reddit (OP)

its DDR4 - 3200Mhz with Ryzen 5700X

[-]

dero_name@reddit

Not suggesting you should buy a new rig.

Just a cheap GPU like AMD Radeon RX 6400 to render your desktop, freeing the whole 20 GB of your XT VRAM for the model. You could fit it entirely in 20 GB with decent context.

[-]

hlacik@reddit (OP)

yeah i am on mITX board with only 1 gpu slot, i do understand your point tho and it completely makes sense .

[-]

lloyd08@reddit

I have a nearly identical setup, 7900xt w/ 3800x, except only 32GB RAM. I use vulkan on ubuntu headless and connect to it from my laptop, and get:

prompt: 650 +/- 400 tps*

eval: 55 -> 40 tps

This is with 8 layers overflowing, once I get above that, there is a noticeable dropoff.

*I get low pp speeds randomly, typically on small prompts which makes me ignore the issue. it might be worth testing with various size prompts if you're benchmarking instead of assuming it's universal. I often only get 250 on my initial "hello world" style prompts when testing, but a serious prompt is usually 600-900. Here's two prompts in a row to demonstrate:

prompt eval time =     226.20 ms /    61 tokens (    3.71 ms per token,   269.67 tokens per second)
       eval time =   33807.72 ms /  1798 tokens (   18.80 ms per token,    53.18 tokens per second)

prompt eval time =     404.06 ms /   348 tokens (    1.16 ms per token,   861.26 tokens per second)
       eval time =   73236.70 ms /  3887 tokens (   18.84 ms per token,    53.07 tokens per second)

Settings when testing it:

    -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M \
    --fit on \
    --fit-target 256 \
    --fit-ctx 90000 \
    --no-mmap \
    --flash-attn on \
    -ctk q8_0 \
    -ctv q8_0 \
    -np 2 \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.00

Normally I run more context, but since I have my desktop and firefox open, I had to trim it down until it offloaded at most 8 layers. You may just want to tweak context until you have 8 or fewer layers offloaded. Given I'm running similar context at q8 k/v, it seems you're just killing your own speed and quality for no reason. Alternatively, I more frequently run the thicc boi 27b, but I run that one in np 1 so it might not be comparable to your use-case.

[-]

Glittering_Focus1538@reddit

This tracks, offloading 10 layers to cpu, even on a APEX mini version of qwen 3.6 i get 45 tok/s on my rx9070

[-]

JaredsBored@reddit

That's slow, especially the prompt processing speed. Why do you have the batch sizes set smaller than default? I'm guessing the batching and quantized KV are causing the slowdown.

[-]

hlacik@reddit (OP)

it was causing tearing in kde , but i have switched to vulkan and now i am keeping it on defaults

[-]