Qwen3.6 35B-A3B very sensitive to quantization ?
Posted by Sudden_Vegetable6844@reddit | LocalLLaMA | View on Reddit | 5 comments
Wondering if it's a fluke of my testing (using LMStudio, runtime 2.14.0 based on llama.cpp release b8861) or if that model is very sensitive to quantization.
I have been testing various quants with the following prompt (thinking ON):
"I need to wash my car, the washing station is 50m away, should I walk or drive there ?"
And only Q8 comes out consistently with "drive" as the answer across multiple runs.
Lower quants at Q4 and even Q6, both from lmstudio and unsloth, come out with "walk" at varying frequencies, failing very often at Q4.
FWIW the 27B is more resilient to that particular test and answers with "drive" consistently at Q4.
No-Refrigerator-1672@reddit
I assume 3.6 behaves in the same way as 3.5. Here is a post by Unsloth detailing how much a model's internal state differs from original by each quant type.
KURD_1_STAN@reddit
It seems it doesnt behave the same as 3.5
Ok-Measurement-1575@reddit
Bf16 does appear to do things the q8 struggles with. Q4KL/XL is very strong but occasionally fumbles.
Watching the q8 call tools is like watching Spiderman swing between buildings at warp speed.
Grudgingly moved up to the bf16 and it is, unfortunately, better.
Impossible_Car_3745@reddit
In my experience, moe models are sensitive to quantization in general. If it is 35b-a3b, it behaves like just 3b model in quantization. I used to use minimax 2.5 awq, so 220b-a10b 4bit quant. And it's just unusable.
Dr_Me_123@reddit
Quantization is lossy and specific to the calibration corpus. For Qwen3.6 35B, I found that q8_0 and bf16 outputs may differ on certain questions, if they are not covered by the corpus. I think a better option is to make a custom quantization and test it on the task.