Let’s talk quants of Gemma and Qwen - 16 vs Q8 vs Q4 - any experiences?
Posted by Borkato@reddit | LocalLLaMA | View on Reddit | 36 comments
Some people say they’d never go under Q8, and others say they find Q3 acceptable! What’s your take?
nickm_27@reddit
It really depends what you use it for. Personally I use 26B-A4B at Q4_K_XL (f16 cache) for voice assistant, chat, and some light coding (basic scripts) and I've had no issues, it runs very reliably since release, I've had no issues.
alex20_202020@reddit
That is I do not understand. If weights are Q4, why use so much memory on f16 cache? How can calculations with Q4s ever give useful f16 precision?
Middle_Bullfrog_6173@reddit
All the actual computations are higher precision. Q4 quants have (some) weights in 4-bit format but they use block-shared scale and offset that are (eventually) bf16. Every single KV value is the sum and product of many previous model weights so there is certainly scope for them to span the full 16-bit range.
MironV@reddit
Q4 means the neuron weights are, at a basic level, mapped to a 4-bit (16 level) scale. These still produce normal floating point activations. KV cache quantizes the activations themselves, which is why quantizing K cache can be more damaging since it accumulates across the sequence and through softmax, changing where the attention goes.
ttkciar@reddit
Unfortunately Gemma 4 is disproportionately impacted by K and V cache quantization, and I don't understand why either.
Quantizing Gemma 4's cache to Q8_0 isn't too bad, but there is some noticeable slight loss of competence. Meanwhile, quantizing Qwen3.5 or Mistral Medium 3.5 K and V cache to Q8_0 exhibits no noticeable quality loss at all.
There seems to be no rhyme or reason to it.
nickm_27@reddit
I’m not sure what you mean, it’s a compound effect and they’re entirely separate.
Regardless of what quant is used for the model, if you quantize the KV cache it reduces KLD further
reto-wyss@reddit
8-bit is as low as i'd go but BF16 is more solid for the 30b class. I tried the NVFP4 of gemma 26b-a4b and I shelfed it in favor of E4B BF16. The 31b NVFP4 is fine, but it's really 8-bit average.
FP8 works fine enough for the 122b, Q4 was good for the 397b, but Q3 was utter garage wouldn't run it over 30b 8bit-ish.
Borkato@reddit (OP)
How does FP8 work vs Q8 GGUFs? I’m getting 48GB vram total soon and I’ve never used a real FP model or whatever
kevin_1994@reddit
Anecdotal but I found a massive difference between q8 and even q6xl for qwen3.6 27b when it came to tool call consistency. At q6xl (or lower) it would fail tool calls around 60k-100k context almost every time in open code. At q8 its more like 25% of chats, and usually only at 100k+ context.
I tried qwen 3.6 35ba3b bf16 and its about equally as bad as q8. Runs into loops often either way.
suprjami@reddit
All KLD I've seen shows almost no difference between Q6 and Q8. So Q6 is my limit.
Look at test results from Unsloth and Ooba.
Icy_Butterscotch6661@reddit
Everyone says this but the unsloth diagram shared around doesn't show q8 at all? Only up to q6. I could be blind
nickm_27@reddit
https://unsloth.ai/docs/models/gemma-4#benchmarks it's there in the image
Borkato@reddit (OP)
There’s no Q8 in the qwen image tho
Icy_Butterscotch6661@reddit
Oh I haven't seen this one, only the Qwen 3.6 one: https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FHq98A18pHA2ePwlInrFG%252Fqwen36_mean_q6k_corrected_arrow_pareto_fixed.png%3Falt%3Dmedia%26token%3Da5190c8a-4d04-4d4d-be94-dd15214e6687&width=768&dpr=2&quality=100&sign=d0938e2f&sv=2
jikilan_@reddit
Original Blu-ray vs Netflix. If you know, you know.
ohhi23021@reddit
this. you can only know by testing yourself, a bench to see if the "movie" is the same isn't the same thing as resolution/compression....
Powerful_Evening5495@reddit
It's difficult to establish a universal guideline due to the multitude of variables involved.
Zc5Gwu@reddit
This. Larger models are generally more tolerant of quantization. Kimi might be usable at Q3 whereas and 4b model might only be usable at Q8.
Some models are also more sensitive to quantization than others. Dense models are more tolerant than MoE.
It also depends on the task. Creative writing might be more tolerant than coding because a more aggressive quant behaves like a higher temperature.
Far-Low-4705@reddit
No, 4b is absolutely usable at Q4.
I really just think it comes down to going with the largest model you can run while quantized.
A 20b model at Q4 will beat a 10b at Q8
cleversmoke@reddit
Qwen3.6-27B Q4_K_M is the best I can do at the moment with q8_0 KV cache and 128k context. It has a hit rate of 95%+ on coding tasks as long as I keep the request within context range, 1-2 compactions are usually ok. Q5_K_S had phenomenal results, but MTP cuts into my vram too much.
Shinkai_I@reddit
Test based on your actual workflow and needs.
pmttyji@reddit
I refuse to go below Q4 even though I have only 8GB VRAM. IQ4_XS is my favorite quant which is smallest Q4.
Mameiro@reddit
For me Q8 is the “don’t think about it” default.
Q4 is fine when I need speed or want to fit a bigger model into VRAM, but I wouldn’t use it for serious reasoning, long-context work, or anything where small errors matter.
Q3 can look okay in casual prompts, but it usually falls apart in edge cases, formatting, tool use, and multi-step tasks. I’d only use it for testing or very low-stakes stuff.
Gesha24@reddit
It all depends on your use case. If all you care is nice fancy numbers in p/s and t/s - even Q1 would do.
BeautyxArt@reddit
Q1 still know 1+1=2 ? or have another answer ?..yeah after thinking
BeautyxArt@reddit
Q8 is a bit lose in front of 16 , imagine lower is much more lose, not a bit but much.
Juan_Valadez@reddit
F16 highest quality, Q8 medium quality, Q4 low quality, less than Q4 lowest quality
ea_man@reddit
you are trying to compare different brand of models and each bran has dense and MoE version of multiple sizes...
Some scale better than others.
Advanced-Picture5016@reddit
> Some people say they’d never go under Q8, and others say they find Q3 acceptable!
and some... have no choice. sadge
Endurance_Beast@reddit
Q6 here
hidden2u@reddit
Miriel_z@reddit
Q4_K_M is one of the recommended quantizations, good tradeoff of quality vs size.
ttkciar@reddit
Q4_K_M is pretty awesome for Gemma-4-31B-it, even for codegen. No complaints here!
Gesha24@reddit
I have been running a Q6 and at context above 100K it is starting to severely degrade, to the point where it would say "oh, I see the bug in the code, let me rewrite this function" and then it would write the new one and never delete the old one...
sophlogimo@reddit
Question is, what do you want to do? If you want complex tasks, every bit of precision helps, but for writing small scripts and doing search tasks, Q3 will probably suffice.
BitGreen1270@reddit
I think it's a matter of capacity at this point. With 32gb ram (integrated), my options are Q3 at best.