Let’s talk quants of Gemma and Qwen - 16 vs Q8 vs Q4 - any experiences?

[-]

nickm_27@reddit

It really depends what you use it for. Personally I use 26B-A4B at Q4_K_XL (f16 cache) for voice assistant, chat, and some light coding (basic scripts) and I've had no issues, it runs very reliably since release, I've had no issues.

[-]

alex20_202020@reddit

Q4_K_XL (f16 cache)

That is I do not understand. If weights are Q4, why use so much memory on f16 cache? How can calculations with Q4s ever give useful f16 precision?

[-]

Middle_Bullfrog_6173@reddit

All the actual computations are higher precision. Q4 quants have (some) weights in 4-bit format but they use block-shared scale and offset that are (eventually) bf16. Every single KV value is the sum and product of many previous model weights so there is certainly scope for them to span the full 16-bit range.

[-]

MironV@reddit

Q4 means the neuron weights are, at a basic level, mapped to a 4-bit (16 level) scale. These still produce normal floating point activations. KV cache quantizes the activations themselves, which is why quantizing K cache can be more damaging since it accumulates across the sequence and through softmax, changing where the attention goes.

[-]

ttkciar@reddit

Unfortunately Gemma 4 is disproportionately impacted by K and V cache quantization, and I don't understand why either.

Quantizing Gemma 4's cache to Q8_0 isn't too bad, but there is some noticeable slight loss of competence. Meanwhile, quantizing Qwen3.5 or Mistral Medium 3.5 K and V cache to Q8_0 exhibits no noticeable quality loss at all.

There seems to be no rhyme or reason to it.

[-]

nickm_27@reddit

I’m not sure what you mean, it’s a compound effect and they’re entirely separate.

Regardless of what quant is used for the model, if you quantize the KV cache it reduces KLD further

[-]

reto-wyss@reddit

8-bit is as low as i'd go but BF16 is more solid for the 30b class. I tried the NVFP4 of gemma 26b-a4b and I shelfed it in favor of E4B BF16. The 31b NVFP4 is fine, but it's really 8-bit average.

FP8 works fine enough for the 122b, Q4 was good for the 397b, but Q3 was utter garage wouldn't run it over 30b 8bit-ish.

[-]

Borkato@reddit (OP)

How does FP8 work vs Q8 GGUFs? I’m getting 48GB vram total soon and I’ve never used a real FP model or whatever

[-]

kevin_1994@reddit

Anecdotal but I found a massive difference between q8 and even q6xl for qwen3.6 27b when it came to tool call consistency. At q6xl (or lower) it would fail tool calls around 60k-100k context almost every time in open code. At q8 its more like 25% of chats, and usually only at 100k+ context.

I tried qwen 3.6 35ba3b bf16 and its about equally as bad as q8. Runs into loops often either way.

[-]

suprjami@reddit

All KLD I've seen shows almost no difference between Q6 and Q8. So Q6 is my limit.

Look at test results from Unsloth and Ooba.

[-]

Icy_Butterscotch6661@reddit

Everyone says this but the unsloth diagram shared around doesn't show q8 at all? Only up to q6. I could be blind

[-]

nickm_27@reddit

https://unsloth.ai/docs/models/gemma-4#benchmarks it's there in the image

[-]

Borkato@reddit (OP)

There’s no Q8 in the qwen image tho

[-]

Icy_Butterscotch6661@reddit

Oh I haven't seen this one, only the Qwen 3.6 one: https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FHq98A18pHA2ePwlInrFG%252Fqwen36_mean_q6k_corrected_arrow_pareto_fixed.png%3Falt%3Dmedia%26token%3Da5190c8a-4d04-4d4d-be94-dd15214e6687&width=768&dpr=2&quality=100&sign=d0938e2f&sv=2

[-]

jikilan_@reddit

Original Blu-ray vs Netflix. If you know, you know.

[-]

ohhi23021@reddit

this. you can only know by testing yourself, a bench to see if the "movie" is the same isn't the same thing as resolution/compression....

[-]

Powerful_Evening5495@reddit

It's difficult to establish a universal guideline due to the multitude of variables involved.

[-]

Zc5Gwu@reddit

This. Larger models are generally more tolerant of quantization. Kimi might be usable at Q3 whereas and 4b model might only be usable at Q8.

Some models are also more sensitive to quantization than others. Dense models are more tolerant than MoE.

It also depends on the task. Creative writing might be more tolerant than coding because a more aggressive quant behaves like a higher temperature.

[-]

Far-Low-4705@reddit

No, 4b is absolutely usable at Q4.

I really just think it comes down to going with the largest model you can run while quantized.

A 20b model at Q4 will beat a 10b at Q8

[-]

cleversmoke@reddit

Qwen3.6-27B Q4_K_M is the best I can do at the moment with q8_0 KV cache and 128k context. It has a hit rate of 95%+ on coding tasks as long as I keep the request within context range, 1-2 compactions are usually ok. Q5_K_S had phenomenal results, but MTP cuts into my vram too much.

[-]

Shinkai_I@reddit

Test based on your actual workflow and needs.

[-]

pmttyji@reddit

I refuse to go below Q4 even though I have only 8GB VRAM. IQ4_XS is my favorite quant which is smallest Q4.

[-]

Mameiro@reddit

For me Q8 is the “don’t think about it” default.

Q4 is fine when I need speed or want to fit a bigger model into VRAM, but I wouldn’t use it for serious reasoning, long-context work, or anything where small errors matter.

Q3 can look okay in casual prompts, but it usually falls apart in edge cases, formatting, tool use, and multi-step tasks. I’d only use it for testing or very low-stakes stuff.

[-]

Gesha24@reddit

It all depends on your use case. If all you care is nice fancy numbers in p/s and t/s - even Q1 would do.

[-]

BeautyxArt@reddit

Q1 still know 1+1=2 ? or have another answer ?..yeah after thinking

[-]

BeautyxArt@reddit

Q8 is a bit lose in front of 16 , imagine lower is much more lose, not a bit but much.

[-]

Juan_Valadez@reddit

F16 highest quality, Q8 medium quality, Q4 low quality, less than Q4 lowest quality

[-]

ea_man@reddit

you are trying to compare different brand of models and each bran has dense and MoE version of multiple sizes...

Some scale better than others.

[-]

Advanced-Picture5016@reddit

> Some people say they’d never go under Q8, and others say they find Q3 acceptable!

and some... have no choice. sadge

[-]