at what point does quantization stop being a tradeoff and start being actual quality loss
Posted by srodland01@reddit | LocalLLaMA | View on Reddit | 34 comments
Been running a few models locally at different quant levels and honestly the jump from Q5 to Q4 sometimes feels like nothing and other times it completely tanks coherence on longer outputs. is there a general rule for where the cliff is, or does it just depend entirely on the model architecture and what you're doing with it. Would love to hear what quant levels people here actually settle on for daily use versus what they use when quality really matters
substandard-tech@reddit
Isn’t a loss in quality the actual tradeoff? This question is nuts
Lesser-than@reddit
I notice an extreme drop in quality when the quant doesnt fit in my hardware /s .
EveningGold1171@reddit
the problem is the domain over which you’re quantizing is not actually the domain over which the model weights are distributed, there’s effectively a non-linear transformation between the weights and their projections, ie the error incurred by quantization is not bounded, in practice that’s somewhat mitigated by a layer normalization, but some models tend to have their weight distribution more spread out than others, which leads to some models quantizing fine down at q4, while others feel lobotomized, other than using some kind of attention aware quantization / similar process that actually compares outputs the success or failure of a quantization is down to luck of how well the training process has lead to the ideal of weights compressed to nearly orthogonal subspaces where you would only really need one bit per dimension to give you a unit vector.
Gringe8@reddit
I see noticeable quality loss under q4km
MaxKruse96@reddit
Anything below the original is actual quality loss. Unless you magically figure out how to perfectly compress 16 bits into less than 16 bits, there always will be.
Pleasant-Shallot-707@reddit
Witness the spawning the MP3 quality debates for this generation.
Borkato@reddit
I once bought $1000 speakers because people convinced me that it was soooooooooooo much better. Returned them two days later after I realized I literally didn’t give a fuck lol
lmamakos@reddit
If only you had oxygen-free speaker wires..... 😀
Nyghtbynger@reddit
That's so funny lol. I have poor music taste, but good palate. As such I favour good restaurants but my audio is aliexpress level
Nyghtbynger@reddit
The mp3 quality debate is defined by the output device quality, not the size of the brain of the person arguing. That's why it hasn't been solved as a problem.
MaxKruse96@reddit
if only people would try to give MP3s access to their life instead of LLMs at Q3 with openclaw 😭
No_easy_money@reddit
In audio, there is lossless compression with FLAC. I wonder if there could be an equivalent for LLMs?
segmond@reddit
There is no magic formula, it varies from model. But the smaller the model, the more likely the loss quality. The loss quality depends on what you are trying to do, for chat, writing, etc. It might not matter much. For precise work like maths, or low level programming think C/C++/Rust vs javascript it matters much. For precise vision work such as OCR, count objects vs describe the object it matters.
Say it after me kids.
Quality of Tokens beats Quantity of Tokens.
SexyAlienHotTubWater@reddit
You start to see perplexity meaningfully increase at around 5 bits. 4 bits is worse. 6 is about the limit before you start seeing quality loss.
fulgencio_batista@reddit
How does NVFP4 compare?
rm-rf-rm@reddit
has anyone compared perplexity/KL divergence metrics to benchmark scores? (and benchmark scores themselves are removed from real world performance - so I have no idea how these technical metrics relate to real world performance)
Septerium@reddit
There is an Youtube channel called "x-create" (or something) where the guy tests new models at different quant levels. He basically uses one-shot prompts for creating complex applications. This is not the way I like to use models for coding (I think we should always break projects into small tasks), but in this type of testing there is almost a noticeable difference of quality from q9 to q6... and the thing gets even worse at q4. I think complex one-shot prompts and long context are the situations where compressed models break hard.
Adventurous-Paper566@reddit
C'est très dependant di modèle, parfois Q4 donne de très bons résultats, parfois Q5 est lobotomisé.
Dans le doute, jamais en dessous de Q6.
Farmadupe@reddit
I'm gonna disagree slightly with some of the other commentors and say there there is no "best tradeoff quant". It completely depends on your hardware most of all.
So you'd pick whatever model/quant runs fast enough on your machine to be usable, and then see what tasks it's brainy enough to do.
For some stuff, you might have a very big model at a low quant (eg ai therapist or waifu or whatever). But if you're bulk classifying your emails then drop down to a tiny model and run at a high quant.
So imo it's completely horses for courses. You can't say "just use a 4bit quant"
Dabalam@reddit
People in these discussions are often using the base models the reference for quality but it feels not very informative to how useful quants actually are. Deviation from base is only relevant in so far as it impacts tasks.
This might be super relevant in Maths and Coding domains where you would expect deviation from base to equate to worse usefulness, or multistep reasoning where reliability or nuance is required to follow a logically train, but it isn't obvious to me that deviation from the base model translates to meaningfully worse performance in other domains.
Even then, it's a matter of degree of deterioration and perplexity scores don't seem to straightforwardly map on to the outcomes of actual use even if they do suggest a trend (i.e. even if Q6 is detectably different from Q8 does the difference matter for tasks). It isn't clear that the way a model deviates from base reliably produces the same kind of task failures across domains.
The methodology in this area is problematic on all sides:
PiaRedDragon@reddit
Already is, I work at a law firm and we use theses guys (https://baa.ai/watchman.html).
They allow us to actually KNOW what the quality is before it goes to production, we can now allow ANY reduction in law capability, models always do poor in this space already, so any degrading of capability in this topic is a huge red flag.
If we fine tune a model we run this across it, if we quantize down we specify we want max capability on Law. Works a treat.
maschayana@reddit
Blatant low effort advertisement. On top of that: Models only are as poor as your harness allows them to be.
PiaRedDragon@reddit
Ok Bob.
Some people can't be helped.
logTom@reddit
A few days ago, someone posted a diagram for Gemma 31B and was surprised that even Q8 already showed accuracy losses. This suggests it matters more for very dense models.
Lucis_unbra@reddit
Gemma 4 is special. I am looking at the probabilities the models have vs the baseline, on text it generated.
Gemma 4 is special, and it's uncertain to me if the error is ok or not.
Most models are not as certain as Gemma, so the error is kinda spread out more. More chance of confabulation or swaps between options.
Gemma 4 is very certain at all times. It has a very clear idea, and its options appear to be cleaner.
It seems to me that quantization is making Gemma "snap" more. So it can register as a louder change in kld or perplexity, but is semantically not that different.
The question is if this means the model is more robust, that it is not meaningfully changing that much.
Or, if this seemingly "Innocent" change is actually very bad, and harder to detect without the right tests.
My knowledge benchmark that tries to see what the chance is that the model knows something, and how likely it is to express it is not done yet. The test I ran on Q8 vs Q4 Gemma (didn't try BF16 yet) came up saying it's very similar.
Tldr: Gemma is different, and Gemma is complicated. Gemma is either quantizing like a champ, or it is silently catastrophic.
danielhanchen@reddit
For large MoEs, in general Q4 MoE weights + Q8 rest work reasonably well
For dense models, generally Q4 is the best.
Imo as models get trained with many more trillions of tokens, this will most likely shift over time - dense models might have to be Q5/Q6 in the future since more trillions of tokens will force gradient descent to use more bits.
AutomataManifold@reddit
We're going to see per-model-component quantization, I think. Keep parts of it at full precision, tightly quantize specific layers, etc. We've seen some trends in this direction, keeping the KV cache unquantized, etc. Worth a research project, anyway.
danielhanchen@reddit
Oh haha that's partially what we do as part of Unsloth dynamic quants :))
sammcj@reddit
That's already the case with modern quantisation techniques (unless I'm misunderstanding what you're saying). Layers are quantised dynamically based on their importance / potential impact. We haven't used static quants (e.g. all INT8/INT4) in a long time.
a_beautiful_rhind@reddit
Below Q3 is where it starts getting obvious, especially after a decent bit of context. I have ran some Q2 deepseek and after 10-16k it really begins to devolve.
For daily drivers I try to keep Q4+.
Lucis_unbra@reddit
depends on the task and the context length, as others have noted. And the model.
General rules. 8 bit quantization is generally fine. Some tasks might get hit more, but it's usually not going to be a problem.
6 bit quantization is also usually fine, some tasks take a bigger hit, some facts might confabulate more and long context might degrade more.
5 bit is usually where the usual measurements stop appearing fine. Below 5 bits the personality is more likely to change a bit, the model is making more and more errors.
4 bits is usually the limit. Below this it gets worse, very very fast.
Mind you, benchmarks will often say all is good, but they tend to focus on tasks that tend to be more consistent. They don't notice as easily the inaccuracies and weird choices the model makes that night frustrate the user. The model needs more guidance and help when quantized. It's not any dumber, at least not until it genuinely collapses and forgets.
So I'd say a good Q4 is usually fine, but it's not ideal. The model will seem a bit rougher, not as polished and it will not feel as capable as it should be. At Q5 the model usually appears to be capable of more complex tasks, it is more attentive. Q6 is usually very close to Q8. At that level it shows up in specific tasks, it might make more errors but otherwise feels very sharp. The main difference here might be things like confabulation on certain facts, unfortunate choices in code, a few more glitches, worse long context vs the native precision variant.
The lower you go, the more it appears to not be working quite right, while still being just as smart otherwise.
ketosoy@reddit
Q4_k_m to Q3_k_m is what I’ve come to refer to as the “lobotomy line” for qwen3.5-35b measured using a subset of lm-eval.
The decrease from 8 to 4 is fairly gradual, below q4 it’s a cliff.
PromptInjection_@reddit
Q8_0 if you are memory rich
Q6_K is a very good compromise, in everyday tasks (for me) it's like lossless.
Q4_K_M can degrade a little bit but most users won't even notice the difference beside very complex tasks
Q3 and below show first signs of degradation that people can mostly feel
We wrote an article about this and compared different quants:
https://www.promptinjection.net/p/ai-llm-the-quantization-cliff-when-does-compression-break-code?utm_source=publication-search
kaeptnphlop@reddit
Predictable, but still an interesting experiment to see where things break down. Thanks for the write up!