at what point does quantization stop being a tradeoff and start being actual quality loss

[-]

substandard-tech@reddit

Isn’t a loss in quality the actual tradeoff? This question is nuts

[-]

Lesser-than@reddit

I notice an extreme drop in quality when the quant doesnt fit in my hardware /s .

[-]

the problem is the domain over which you’re quantizing is not actually the domain over which the model weights are distributed, there’s effectively a non-linear transformation between the weights and their projections, ie the error incurred by quantization is not bounded, in practice that’s somewhat mitigated by a layer normalization, but some models tend to have their weight distribution more spread out than others, which leads to some models quantizing fine down at q4, while others feel lobotomized, other than using some kind of attention aware quantization / similar process that actually compares outputs the success or failure of a quantization is down to luck of how well the training process has lead to the ideal of weights compressed to nearly orthogonal subspaces where you would only really need one bit per dimension to give you a unit vector.

[-]

Gringe8@reddit

I see noticeable quality loss under q4km

[-]

MaxKruse96@reddit

Anything below the original is actual quality loss. Unless you magically figure out how to perfectly compress 16 bits into less than 16 bits, there always will be.

[-]

Pleasant-Shallot-707@reddit

Witness the spawning the MP3 quality debates for this generation.

[-]

Borkato@reddit

I once bought $1000 speakers because people convinced me that it was soooooooooooo much better. Returned them two days later after I realized I literally didn’t give a fuck lol

[-]

lmamakos@reddit

If only you had oxygen-free speaker wires..... 😀

[-]

Nyghtbynger@reddit

That's so funny lol. I have poor music taste, but good palate. As such I favour good restaurants but my audio is aliexpress level

[-]

Nyghtbynger@reddit

The mp3 quality debate is defined by the output device quality, not the size of the brain of the person arguing. That's why it hasn't been solved as a problem.

[-]

MaxKruse96@reddit

if only people would try to give MP3s access to their life instead of LLMs at Q3 with openclaw 😭

[-]

No_easy_money@reddit

In audio, there is lossless compression with FLAC. I wonder if there could be an equivalent for LLMs?

[-]

segmond@reddit

There is no magic formula, it varies from model. But the smaller the model, the more likely the loss quality. The loss quality depends on what you are trying to do, for chat, writing, etc. It might not matter much. For precise work like maths, or low level programming think C/C++/Rust vs javascript it matters much. For precise vision work such as OCR, count objects vs describe the object it matters.

Say it after me kids.

Quality of Tokens beats Quantity of Tokens.

[-]

SexyAlienHotTubWater@reddit

You start to see perplexity meaningfully increase at around 5 bits. 4 bits is worse. 6 is about the limit before you start seeing quality loss.

[-]

fulgencio_batista@reddit

How does NVFP4 compare?

[-]

rm-rf-rm@reddit

has anyone compared perplexity/KL divergence metrics to benchmark scores? (and benchmark scores themselves are removed from real world performance - so I have no idea how these technical metrics relate to real world performance)

[-]

Septerium@reddit

There is an Youtube channel called "x-create" (or something) where the guy tests new models at different quant levels. He basically uses one-shot prompts for creating complex applications. This is not the way I like to use models for coding (I think we should always break projects into small tasks), but in this type of testing there is almost a noticeable difference of quality from q9 to q6... and the thing gets even worse at q4. I think complex one-shot prompts and long context are the situations where compressed models break hard.

[-]

Adventurous-Paper566@reddit

C'est très dependant di modèle, parfois Q4 donne de très bons résultats, parfois Q5 est lobotomisé.

Dans le doute, jamais en dessous de Q6.

[-]

Farmadupe@reddit

I'm gonna disagree slightly with some of the other commentors and say there there is no "best tradeoff quant". It completely depends on your hardware most of all.

So you'd pick whatever model/quant runs fast enough on your machine to be usable, and then see what tasks it's brainy enough to do.

For some stuff, you might have a very big model at a low quant (eg ai therapist or waifu or whatever). But if you're bulk classifying your emails then drop down to a tiny model and run at a high quant.

So imo it's completely horses for courses. You can't say "just use a 4bit quant"

[-]

Dabalam@reddit

People in these discussions are often using the base models the reference for quality but it feels not very informative to how useful quants actually are. Deviation from base is only relevant in so far as it impacts tasks.

This might be super relevant in Maths and Coding domains where you would expect deviation from base to equate to worse usefulness, or multistep reasoning where reliability or nuance is required to follow a logically train, but it isn't obvious to me that deviation from the base model translates to meaningfully worse performance in other domains.

Even then, it's a matter of degree of deterioration and perplexity scores don't seem to straightforwardly map on to the outcomes of actual use even if they do suggest a trend (i.e. even if Q6 is detectably different from Q8 does the difference matter for tasks). It isn't clear that the way a model deviates from base reliably produces the same kind of task failures across domains.

The methodology in this area is problematic on all sides:

Perplexity is a proxy but doesn't tell you how much the model deteriorates in the domain of interest with quantization
Benchmarks have over-fitting issues
Assessing the model in your own use case is inevitably plagued with confirmation bias and small sample size. Our assessment of differences are prone to error and bias, so at small samples the answer we get is often different than when we use the model for longer (the phenomena where people believe models have "deteriorated over time" might point to this).

[-]

PiaRedDragon@reddit

Already is, I work at a law firm and we use theses guys (https://baa.ai/watchman.html).

They allow us to actually KNOW what the quality is before it goes to production, we can now allow ANY reduction in law capability, models always do poor in this space already, so any degrading of capability in this topic is a huge red flag.

If we fine tune a model we run this across it, if we quantize down we specify we want max capability on Law. Works a treat.

[-]

maschayana@reddit

Blatant low effort advertisement. On top of that: Models only are as poor as your harness allows them to be.

[-]

PiaRedDragon@reddit

Ok Bob.

Some people can't be helped.

[-]

logTom@reddit

A few days ago, someone posted a diagram for Gemma 31B and was surprised that even Q8 already showed accuracy losses. This suggests it matters more for very dense models.

[-]

Lucis_unbra@reddit

Gemma 4 is special. I am looking at the probabilities the models have vs the baseline, on text it generated.

Gemma 4 is special, and it's uncertain to me if the error is ok or not.

Most models are not as certain as Gemma, so the error is kinda spread out more. More chance of confabulation or swaps between options.

Gemma 4 is very certain at all times. It has a very clear idea, and its options appear to be cleaner.

It seems to me that quantization is making Gemma "snap" more. So it can register as a louder change in kld or perplexity, but is semantically not that different.

The question is if this means the model is more robust, that it is not meaningfully changing that much.

Or, if this seemingly "Innocent" change is actually very bad, and harder to detect without the right tests.

My knowledge benchmark that tries to see what the chance is that the model knows something, and how likely it is to express it is not done yet. The test I ran on Q8 vs Q4 Gemma (didn't try BF16 yet) came up saying it's very similar.

Tldr: Gemma is different, and Gemma is complicated. Gemma is either quantizing like a champ, or it is silently catastrophic.

[-]

danielhanchen@reddit

For large MoEs, in general Q4 MoE weights + Q8 rest work reasonably well

For dense models, generally Q4 is the best.

Imo as models get trained with many more trillions of tokens, this will most likely shift over time - dense models might have to be Q5/Q6 in the future since more trillions of tokens will force gradient descent to use more bits.

[-]

AutomataManifold@reddit

We're going to see per-model-component quantization, I think. Keep parts of it at full precision, tightly quantize specific layers, etc. We've seen some trends in this direction, keeping the KV cache unquantized, etc. Worth a research project, anyway.

[-]

danielhanchen@reddit

Oh haha that's partially what we do as part of Unsloth dynamic quants :))

[-]

sammcj@reddit

That's already the case with modern quantisation techniques (unless I'm misunderstanding what you're saying). Layers are quantised dynamically based on their importance / potential impact. We haven't used static quants (e.g. all INT8/INT4) in a long time.

[-]

a_beautiful_rhind@reddit

Below Q3 is where it starts getting obvious, especially after a decent bit of context. I have ran some Q2 deepseek and after 10-16k it really begins to devolve.

For daily drivers I try to keep Q4+.

[-]

Lucis_unbra@reddit

depends on the task and the context length, as others have noted. And the model.

General rules. 8 bit quantization is generally fine. Some tasks might get hit more, but it's usually not going to be a problem.

6 bit quantization is also usually fine, some tasks take a bigger hit, some facts might confabulate more and long context might degrade more.

5 bit is usually where the usual measurements stop appearing fine. Below 5 bits the personality is more likely to change a bit, the model is making more and more errors.

4 bits is usually the limit. Below this it gets worse, very very fast.

Mind you, benchmarks will often say all is good, but they tend to focus on tasks that tend to be more consistent. They don't notice as easily the inaccuracies and weird choices the model makes that night frustrate the user. The model needs more guidance and help when quantized. It's not any dumber, at least not until it genuinely collapses and forgets.

So I'd say a good Q4 is usually fine, but it's not ideal. The model will seem a bit rougher, not as polished and it will not feel as capable as it should be. At Q5 the model usually appears to be capable of more complex tasks, it is more attentive. Q6 is usually very close to Q8. At that level it shows up in specific tasks, it might make more errors but otherwise feels very sharp. The main difference here might be things like confabulation on certain facts, unfortunate choices in code, a few more glitches, worse long context vs the native precision variant.

The lower you go, the more it appears to not be working quite right, while still being just as smart otherwise.

[-]

ketosoy@reddit

Q4_k_m to Q3_k_m is what I’ve come to refer to as the “lobotomy line” for qwen3.5-35b measured using a subset of lm-eval.

The decrease from 8 to 4 is fairly gradual, below q4 it’s a cliff.

[-]

PromptInjection_@reddit

Q8_0 if you are memory rich
Q6_K is a very good compromise, in everyday tasks (for me) it's like lossless.
Q4_K_M can degrade a little bit but most users won't even notice the difference beside very complex tasks
Q3 and below show first signs of degradation that people can mostly feel

We wrote an article about this and compared different quants:
https://www.promptinjection.net/p/ai-llm-the-quantization-cliff-when-does-compression-break-code?utm_source=publication-search

[-]

kaeptnphlop@reddit

Predictable, but still an interesting experiment to see where things break down. Thanks for the write up!