The use Q8 a waste of resources?
Posted by Spiderboyz1@reddit | LocalLLaMA | View on Reddit | 27 comments
I can run G4 31B Q8 XL with ctx 75k and Gwen's 27B and 35B Q8 XL ctx 145k, but I'm wondering if I'm wasting GB of SSD and VRAM.
Is it worth upgrading to Q6 K? To save disk space and gain a little more T/s and more context? Or does intelligence deteriorate significaly "Kld" or "kl"?
Is Vision affected by using Q6?
Q6 K XL is much better than "Q6 K" normal?
tmvr@reddit
There is no practical difference between Q6_K and Q8_0 so if you need to squeeze in more context then drop down to Q6_K. As for Qwen3.6 that's even less sensitive to quantization (which only really comes into play under Q5) so Q8_K_XL is definitely an overkill, even just by dropping down to Q8_0 you get about 7GB of VRAM back. and considerable decode/tg speed improvement.
There are people swearing that "anything under Q8 is not OK" etc. but you have this in every field with everything. With any measurable metric like PPL or KLD or even results for various benchmark runs there is no difference between FP16 and Q8_0 and even Q6_K. Yes, there are very slightly different values, but on a linear graph the line from FP16 to Q6_K is basically a horizontal line, you really need some aggressive log scale to show some difference. It's like saying Usain Bolt lost a race because a bird shit on his shoulders and the added weight made him slower and lose the race. Did the shit add some weight? Yes. Was it a deciding factor in his performance? Definitely not.
PhotographerUSA@reddit
Q8.0 is far more intelligent and can do far more. If you quant it then you're dumbing it down big time.
tmvr@reddit
See OP this is exactly what I meant.
This is a meaningless statement. Nothing quantifiable just big statement with "far more". What does this mean exactly? Where are the 1:1 comparisons for tasks done with Q8_0 and Q6_K for example where it shows Q6_K failing miserably and Q8_0 being "far more intelligent"? Nowhere, as usually.
What do you mean by this? Q8_0 is already quantized. It's not the holy grail that then gets quantized, it itself is already a quantized version and the other ones are not created from the Q8_0 version. Also, same question as above - what do you mean "dumbing it it down big time" here? What is the difference between the (you somehow thinking being not quantized) Q8_0 and Q6_K? Not some empty trust me bro style statement, but actual results. Like "I did these tasks and Q8_0 aces them every time, but Q6_K fails miserably, here are the details' style.
stoppableDissolution@reddit
Well, problem is that you can go all the way down to q1 with that logic. Q8 is almost the same as q6_xl, q6_xl is almost the same as q6, q6 is almost the same as q5_xl...
I doubt there is a task that will show a drop off the cliff difference between two adjacent quants, so the question of when few stones become a pile lays on the end user and can not be properly quantified.
brown2green@reddit
Short-context benchmarks on common knowledge and greedy decoding won't show significant differences between a quantization and the other; this is the main reason why this idea that anything above 6-bit is lossless persists.
Due_Net_3342@reddit
if you can find benchmarks showing the accuracy of those quants and are mostly the same vs original then yes. But never assume all models quants are the same, even in the same family
a_beautiful_rhind@reddit
The mmproj or vision tower below Q8 is not a good idea.
audioen@reddit
It is very difficult to say for certain. I am using FP8 and Q6_K right now, mostly because Q6_K is slightly faster than Q8_0, and shouldn't be any worse. For instance, here is unsloth showing results for the 35b-a3b: https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs which suggests that mean K-L divergence is less than 0.01 from 5 bits onwards, and my guess is that dense model is less sensitive to quantization than 3b active MoE.
Very roughly, it seems that each additional bit in weights will halve the mean K-L divergence, until we are at 6 bits and improvement seems to stop, despite we are not yet near zero. If we extrapolate the early part of the graph, from e.g. 2 to 4 bits, we can see that rate of improvement is slowing down, and that 5 bits are improving less than that linear trend, and 6 bits only very slowly, and assuming that trend continues, then q8_0 is much bigger but again only very slightly more faithful to the original.
It also behooves to remember the implication of a logarithmic y-axis: the original model's point would reside at the negative infinity in this scale, and by the time bf16 is used, the divergence is zero and that is what you would get. However, even the slight and random perturbations in the model weights due to the quantization seems to cause enough error that K-L divergence can't get much better than somewhere in 0.001 and 0.002. I personally do not think that these differences are very significant at 6 bits and beyond, and in fact task performance is usually reasonable even down to 4 bits, despite I can see with a 4-bit model that the thinking output has become more confused and model no longer accurately seems to be able which was its own output from user's command, and begins to make more tool call errors and restates paths incorrectly, etc.
FoxiPanda@reddit
I think we can see some of this in the data that /u/danielhanchen @ Unsloth kindly produces. Check out some of his submitted KLD graphs and you can make up your own mind here:
https://old.reddit.com/user/danielhanchen/submitted/
My opinion? It varies model to model but Q6 is generally very good
Icy_Butterscotch6661@reddit
It doesn't show q8
Non-Technical@reddit
I’ve read that Q6 is nearly indistinguishable from Q8 so I use Q6 is my baseline.
Kahvana@reddit
Aye, Q8 is mostly important for non-latin use or not wanting imatrix quants.
ProfessionalSpend589@reddit
Seems about right.
I started playing with Mistral Small 4 119B in UD-Q5_K_XL and asking questions in my language - some words were replaced by English appearing words (with mistakes) or Latin letters in words.
When I switched to UD-Q8_K_XL the mistakes disappeared. But I was a bit offended by one inverted fact in a list of 10 I asked about.
On the other hand it appears the Mistral Small 4 isn’t biased against consumer rights compared to Gemma 4 26B A4B. While Gemma defended user hostile language in one EULA and refused to point it for what it was, Mistral Small 4 (Q5 as mentioned earlier) said right from the start it’s user hostile and a very aggressive EULA.
Steam list about 15 euros on that purchase.
Kahvana@reddit
How's Mistral Small 4 been for you?
ProfessionalSpend589@reddit
I downloaded the UD-Q8_K_XL quant yesterday and apart from the EULA test and a few popular questions I haven’t tested it much.
I had trouble running it on one Strix Halo + eGPU (120GB + 32GB VRAM) and had to distribute the weights on a second machine to get it running. It may be a configuration mistake on my part that it couldn’t run on one node.
I also tested it to summarise me a book of 120k tokens, but it paused a few times during the summarisation and I had to prompt it with "continue" to start again. Not sure why.
ComfyUser48@reddit
Depends what you use it for. For coding for example, I'd go with Q8 if you can.
Karyo_Ten@reddit
If you can use Q8, just use FP8 in vLLM or SGLang with MTP and hardware accelerated fp8 kernels, proper concurrency to 10+ concurrent queries, most of the time 0 context reprocessing on follow-up due to PagedAttention/RadixAttention.
jacek2023@reddit
You can only answer this question by trying multiple quants by yourself, on your own usecases.
Never trust "reddit experts", they just know the benchmarks or youtube slop.
jikilan_@reddit
You can probably compare it just like blue ray disk vs Netflix streaming. 4K movie with Dolby Atmos. Until you have the need to be have the high quality in what you are doing.
jwpbe@reddit
it all comes down to your use case. You could go all the way down to Q4 if you're just using it as a chatbot. If you have the vram to put it in memory at Q8_XL, you're better off looking at vllm solutions and getting an FP8 implementation.
If I put mtp on with vllm, I can get FP8 qwen 3.6 27b to run between 50 and 80 tokens a second on two 3090's. From what I understand, that's legitimately as good as lossless.
Altruistic_Heat_9531@reddit
Welp I’ve got an answer, and it essentially comes down to "needle in a haystack". Talking about MoE for contexts up to \~32K isk tokens, Q6 and Q8 quantization is practically same. However, as context 72K+, retrieval degradation becomes apparent. The model may start forget on earlier ctx or, in the worst cases, fall into repetitive loops. But this stuff is most prominent in Qwen MoE. However, Qwen 27B quite immune up to 128K\~144K ish contexts when using Q4–Q6, with noticeable degradation only appearing around 200K.
false79@reddit
I wouldn't approach this problem because you want to save space or vram (given how close these models are in size).
Where it would be a legit factor is if you are remotely storing this and it's a fee every month for your VPS.
Otherwise be more liberal, stick with the model that most reliable for your needs.
sine120@reddit
From what I've seen of KLD, Q6_K_XL seems to be as good or better as Q8. I've read Q8 is a simple operation, so it might run faster on certain hardware like CPU inference, but usually memory bandwidth is the bottleneck for me. Larger models seem to go slower.
Spiderboyz1@reddit (OP)
That's what makes me doubt whether the Q6 K is now the optimal point that also saves resources. If it's almost the same as Q8, there's no point in having Q8
overand@reddit
It's worth trying one then the other with several tasks that you tend to use, IMO.
There are a few posts about quality relative to quant here, I expect someone will chime in. I believe at least one of the Gemma-4 models (maybe all of them?) is pretty sensitive to quanitzation
Spiderboyz1@reddit (OP)
Of course, because I remember reading in another post that the G4 26B and 31B, even on the Q8, had a little high KLD, and I was surprise. Qwen instead to be more resistant
CircularSeasoning@reddit
No one knows. We don't have that much SSD and RAM or time and money to test these things extensively. We're living on borrowed GPUs. Vibe-coding is the future because it's been vibes all along.