What's the smallest reasonable quant for coding?

Posted by Real_Ebb_7417@reddit | LocalLLaMA | View on Reddit | 5 comments

So this is something that's hard for me to fully understand. I've been playing with many different coding models and quants recently and in one-shot tests it often happens that a smaller quant of the same model does better than a bigger one (eg. Q3 vs Q4). I know that in a one-shot test, it's just a luck factor, but it shows that a smaller quant can also be "good enough".

So I'm thinking about a tradeoff between a better model with lower quant or a worse mode with a bigger quant. I know that it also depends on a specific usecase usually, but let's generalize it. As an example, I can run Qwen3.5 27b in Q6 (and this model is enough for almost anything), but yesterday I also briefly tested MiniMax M2.7 in Q3_XXS and it still gave me a nice speed + it was actually doing pretty well. However, I also want to try some Q2 version, because Q3 doesn't leave me much space for kv cache. And so, in this case, I know that Qwen is good enough and not worth switching to MiniMax probably, but that's not the point. I rather wonder - what quant is usually the smallest one that makes it usable at coding? Q3 with MiniMax gave me pretty neat results, but what about Q2? Or even Q1? (I always considered Q1 unusable for almost anything, but maybe I'm wrong).

I'm also aware that it depends on a model and quantization method, BUT as a general thing - what quant is usually the smallest reasonable option for coding? And what is the tradeoff? (eg. MiniMax in Q3 as I said is doing pretty well for me, but what am I actually losing compared to running eg. Q4, which is usually considered the best go-to, if you don't have the hardware, but still want quality)

[-]

Fit-Produce420@reddit

It depends on the model.

Some quantize badly, others you can go really small OR were natively trained at a lower quant OR at mixed quants.

Lissanro@reddit

Q3 level quants of a bigger model still may be better than Q4 or larger of a smaller one. Some people say Q2 or even lower work, and it may for certain use cases, but for agentic coding with complex tasks it may cause issues, since can lead to compounding errors. Since it is use-case specific, if you need to know for sure, the best way is to just give it a try.

I use agentic coding daily, and even though I want as much performance as I can get, IQ4_XS is the lowest quant I go with. For Minimax M2.7, I decided to to go with Q5_K_S, it is practically the same as Q5_K_M in quality (I could not notice statistically significant difference in my testing) but slightly faster; at Q4 it starts to have noticeable degradation. Q3 quants are something in between, sometimes they seem to work well, sometimes give mistakes or errors that rarely happen with a bigger model, especially in less common languages or harder tasks. I did not test Q2 or lower though.

That said, there is no exact rule about what to use - the best way is to test multiple quants with your actual tasks and see what works reliably for you. It also depends on what you are comparing. For example, even Q3_XSS of Minimax is likely to be better in most cases (that do not need the vision capability) than Qwen3.5 27B, or even Qwen3.5 122B. But Q1 quants of Qwen3.5 397B or GLM-1 are going to be a disaster for coding, so if Q3_XSS of Minimax is what fits in your memory, and you confirmed it is working for your use cases better, then there is nothing wrong with using it.

JohnMason6504@reddit

The floor for code generation is not where people think it is. Q4_K_M holds about 97 percent of fp16 HumanEval-plus on Qwen 27b class models. Q3_K_S drops to roughly 90 percent and you start seeing pattern collapse on longer tool calls and multi-file edits. the delta is not uniform by task single-function completion tolerates Q3 fine but agentic coding that chains twenty edits stacks error fast. one bad token in a tool call argument and the whole plan breaks. that is where K v V asymmetry actually matters too. K is more sensitive to quant than V for code contexts for MiniMax M2 at Q3_XXS the expert routing takes another hit. MoE saturates the quant budget differently than dense. if you can fit Q4_K_M on a 27b dense that beats Q3 on a bigger MoE for code in my measurements. the router is the fragile part

Real_Ebb_7417@reddit (OP)

So you're saying, that the difference between quants will be more visible on longer, multi-step tasks where one random mistake due to KLD can cascade into more and more over time?

jojorne@reddit

have you tried IQ4_NL?