[Qwen3.6 35b a3b] Used the top config for my setup 8gb vram and 32gb ram, and found that somehow the Q4_K_XL model from Unsloth runs just slightly faster and used less tokens for output compared to Q4_K_M despite more memory usage

Posted by EggDroppedSoup@reddit | LocalLLaMA | View on Reddit | 12 comments

Config

Metric M Model XL Model Difference
Avg Tokens/sec 28.92 29.78 +0.86 (+3.0%)
Median Tokens/sec 30.96 32.08 +1.12 (+3.6%)
Avg Wall Seconds 108.03s 99.93s -8.10s (-7.5%)
Avg Output Tokens 3,031.8 2,895.8 -136 (-4.5%)
Avg Input Tokens/sec 50.20 55.96 +5.76 (+11.5%)
Avg Decode Tokens/sec 75.89 76.44 +0.55 (+0.7%)

Runs \~33% slower for the first run because my code has a bug that includes the initiation time, and as you know for an moe model you have to pass it from storage into ram. It's run 5 times to try to cancel is out, but still included it because that's how i would realistically use it (turning it on, using it once, turning it off to run something, etc).