Higher precision or higher parameter count

Posted by redblood252@reddit | LocalLLaMA | View on Reddit | 27 comments

I’m wondering if we take models of the same family (e.g qwen3.5 moes). And we compared ggufs that are of different core counts different quantizations but similar sizes.

Which model would be better for tasks? If it varies I’m mostly interested in coding and tool calling.

An example is qwen3.5 122b ud-iq2_xxs is 36.6gb and Qwen3.5 35b q8_0 is 36.9gb

Which would be better at coding/tool calling?

In spirit of the same question how interesting is it to run very large models like kimi 2.6 at 1bit precision vs smaller models at higher precisions.

[-]

-dysangel-@reddit

Some model checkpoints just don't quantize well, but if you get a good quant, then in my experience the higher param count is going to crush the smaller model even if it's bf16. For example GLM 5.1 at IQ2_XXS always gives better results than anything else I have.

[-]

rpkarma@reddit

I wonder what quant will fit on a DGX Spark at ~20tk/s generation speed

[-]

-dysangel-@reddit

You'd need at least 2 sparks hooked up unless you are also willing to REAP the shit out of it. All the 5.1 REAPs I've tested so far have not worked well.

[-]

relmny@reddit

Yeah, my experience as well. Deepseek/kimi/GLM (or qwen3.5-397b) at around q2 or q3 and less, bit any other model at q6/q8.

[-]

EffectiveCeilingFan@reddit

This is a pretty classic question. In general, I’ve found the best performance to be whatever you can run at Q4. Below Q4, you start to see major degradation. You still see degradation at Q4, especially in non-English languages, but the higher parameter count you get in exchange balances it out.

[-]

HopePupal@reddit

only way to be sure is to measure it. run your own evals and find out. larger models tend to be more resistant to quant damage, but only as a rule of thumb

[-]

relmny@reddit

Yeah, there is nothing that compares to own tests.

That's why to any question like "which model is better, which quant is better, is x better than y" and all those related questions, the only proper and valid answer is "you tell us". Because everyone have their own needs/hardware/etc.

[-]

redblood252@reddit (OP)

You’re saying there isn’t something that makes one theoretically better and it depends on the models and the use case?

[-]

LeRobber@reddit

There are things that suggest strong quality differences for differerent levels of specific things, but you have to do work to measure it.

You up for doing real quantitative analysis? If not, Try qwen 3.6 35B and call it a day

[-]

redblood252@reddit (OP)

I actually do use qwen3.6 35b. But was always curious about this question so I used the 3.5 family since they had different sizes.

[-]

LeRobber@reddit

You can essentially run certain statistical tests to figure out how badly quanitzations hits a particular LLM. I haven't run them on qwen3.5, but I have a paper in my history on reddit I can go do a deep dive to find you can find out how to run them yourself if you want to know.

[-]

Charming_Support726@reddit

Down to q4 differences barely matter these days
The more Parameter the better.
MoE are problematic to compare - because they got only a small amount of active params. In specialized tasks the sqrt ( Total * Active) estimation will fail. And models start to behave more like their "active size" instead of the calculated combined size

[-]

Hot_Turnip_3309@reddit

it depends on the model architecture and the quality and quantity of the training data. but between 24-31 at 4bit dense ideal over MoE .. for local models that fit in 24gb vram.

[-]

BelgianDramaLlama86@reddit

Due to the extremely low quant for the bigger model I'm gonna tend toward the 35B here... There's no way it's not lobotomized. At Q3_K_XL for example you'd actually have a fight.

[-]

-dysangel-@reddit

the question is not whether or not the large model is lobotomized, it's which performs better - perfect small model, or lobotomized large model?

[-]

rpkarma@reddit

Depends on the small model, but in general perfect small model is best.

[-]

JLeonsarmiento@reddit

in theory:

Big model small quant > small model big quant

Reason is simple: imagine they’re both trained in the same datasets. Patterns from data are stored either wide (lots of neurons) or either deep (lots of decimals). When you quantize you remove/round decimals, so the deep dependent models suffers more (small model). You don’t remove parameters with quantization (reap and prune do remove, but you might broke some weak connections) so they hold better.

BUT in reality:

Small model big quant >>> Big model small quant

Because you’re always memory/compute constrained, so it’s key to access the same amount of “information” reliable with less resources, less divergence and less perplexity in the process.

Model benchmarks are done at fp, so you’re close to WYSIWYG with small models big quant too.

[-]

computehungry@reddit

This is very underexplored. Quant publishers don't benchmark performance and rather opt for ppl/kld, so it's hard to know unless you test it yourself. The normal benchmarks you see from model cards take days to run fully (on consumer gpus).

Then there are these problems:

Some models are horrible under quantization and some are not. Can't make a rule of thumb, have to test everything.
Some models do good in benchmarks but are horrible in personal use cases. Benchmarks span so many stuff that you'll never use..

[-]

Fit-Statistician8636@reddit

Everything other people said here is true. Just one more input: Users using cloud models often rant that “ChatGPT got worse”, or “Opus got worse” today… I experienced it too. It might be just a bad prompt slipping in - but most often, the degradation comes from quantization. Try not to go below Q6, possibly try Q5, but certainly don’t go below Q4 for coding. Doesn’t apply for storytelling I believe - some errors might be even welcomed there :).

[-]

suicidaleggroll@reddit

More parameters is better down to about Q4. Below that, intelligence starts to tank fast. In this case, I'd take the 35B before the 122B at IQ2_XXS, but if you were comparing the 35B Q8 to a hypothetical 70B Q4, I'd take the 70B.

This is in general of course, some models are still usable down to Q3, and some small models punch above their weight, but I wouldn't trust any model at Q2.

[-]

Ell2509@reddit

It does very much depend on the model, how well is can be quantized.

Some can handle q4, others drop fast below as high as q8.

I am sure you know this, but lots of people are learning here so thought that I would share a little more. Hope you don't mind.

[-]

robogame_dev@reddit

Maths wise, each param represents a node where meaning / information can be trained into.

More param count = more information, you have more individual semantic units in the system, it’s got a more granular knowledgebase which usually means more detailed world knwoledge, all else being equal.

Now you take those params and quant them, you’re essentially keeping the same granularity of worldview, but reducing the connectivity between those params - so we’re going to lose some of the lower probability connections, the more we round the numbers to fit in smaller bits, the more we lose rarer token sequences.

So, at the same GB size, roughly: - more params = more world knowledge - more bits per param = more reasoning paths

I’m on a ~40gb VRAM budget and I don’t go below Q4, and use Q6 when quality matters.

[-]

ratocx@reddit

AFAIU the quality is degrading fast below 4bit, especially for long context tasks. While the knowledge base is certainly larger with more parameters, I would think a smaller model with higher precision is better for reliability in tool calling and code completion, than a large model with little precision.

[-]

grumd@reddit

This is actually a very good question. Q8 and Q6 for example barely differ in quality. Q4 is still good but has some degradation. Starting from Q3 and especially Q2 you see real very significant losses.

I'd personally say that at same size F16 small model will be much worse than a bigger model at Q4-Q6. Q4 in particular seems like a very good middleground where you're getting most of the model's capabilities with significantly lower size.

But if you're comparing Q8 small vs Q2 large, I'd say Q8 is probably better, but it's hard to say, you need to run benchmarks

[-]

HeavyConfection9236@reddit

I feel like extremely low bit quants are, as someone put it, "for the desperate". If I think of using a model as washing your hands, would you rather wash your hands with a small, deep cup of water (small model, big quant) or a shallow plate holding some water (big model, small quant)?

This is to say, I think you can get better results trying to fit a smaller model with its intelligence (what little amount it has) into your usecase without quantizing it, maybe with MCP or other tools, rather than hoping and praying that a tiny quant of a big model won't be erratic or dumb.

[-]

-dysangel-@reddit

so you feel and think - have you tried?

[-]

Mindless_Pain1860@reddit

From my experience, use higher precision, because they are reasoning models, very sensitive to quantization error