Low performance in 7900XTX in Qwen 3.6 35B A3B
Posted by soyalemujica@reddit | LocalLLaMA | View on Reddit | 11 comments
When I first setup my PC, I did get 92t/s in Qwen3.6 35B A3B, and now for some reason it won't ever get past 30t/s no matter what settings I use, either rocm or vulkan.
.\llama-server.exe --model ../models/Qwen3.6-35B-A3B-UD-Q5_K_M.gguf -ctv q8_0 -ctk q8_0 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0
GPU usage is 100%, wattage is at 250w\~
Using Qwen 27B Q4KM
.\llama-server.exe --model ../models/Qwen3.5-27B.Q4_K_M.gguf -ctv q8_0 -ctk q8_0 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 -fa on -fit on -c 100000
and I can't get above 29t/s which sounds reasonable I guess.
realmaxwei@reddit
definitely it's OOM because of larger computing buffer (for AMD gpu this one is significant). Try sweeping prompt size from 16K with each 16k increment to find the sweet spot. Shrinking ubatch will also help ( reducing constent prefiiling tk/s tho)
soyalemujica@reddit (OP)
There has to be other reason, I restarted my PC and now I get 100t/s
renczzz@reddit
Hmm I get at least the double of like 70 tok/sec and fast processing prompt with 128000 context on Ubuntu 24 with RX7900 XTX 24GBvram with LM Studio and qwen3.6 a3b 35b IQ4_XS gguf of unsloth fully offloaded in the VRAM, could be the difference between windows and ubuntu. I always get lower performance somehow on windows. I use the default unsloth docs to run the right parameters. The only thing is it keeps thinking in loops here and there. Haven't solved that yet. Performance is not too bad on this card. Make sure the while model fits in the VRAM, looks like Q5_K_M is too big for the 24 GB vram. You could use the smaller gguf models.
Fidrick@reddit
35B model at Q5 is bigger than 24GB isn't it? You need Q4 with small context to fit in your 7900xtx
soyalemujica@reddit (OP)
Q4 model still is at the same t/s
nickm_27@reddit
Check your context. The Q4_K_XL unsloth model got larger so it might be offloaded to CPU
soyalemujica@reddit (OP)
3.6 35B A3B is MoE so it should not matter
BigYoSpeck@reddit
MOE isn't a save all for performance. All weights entirely in VRAM and you're over 120tok/s
Configure MOE models larger than your VRAM properly with -ncmoe and they still perform quick enough for most use cases but depending on how many expert layers you offload the CPU can still be nowhere near peak performance
You aren't configuring CPU offload, so your GPU is just using shared system memory over the PCIe bus which is slow as hell
If you really want the Q5 quant then you need to figure out how many layers to offload via -ncmoe and probably be able to get more like 50tok/s
Or, just use IQ4_NL and enjoy over 100tok/s
nickm_27@reddit
That’s not correct. You will still see a slowdown if the model is partially offloaded vs 100% on VRAM. I’ve had the same issue while trying to max out KV cache and it is very obvious when something spilled into system RAM
No_Algae1753@reddit
Try with offloading
soyalemujica@reddit (OP)
How ?