Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation
Posted by gvij@reddit | LocalLLaMA | View on Reddit | 150 comments
Evaluated Qwen 3.6 27B across BF16, Q4_K_M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer.
Benchmarks used:
- HumanEval: code generation
- HellaSwag: commonsense reasoning
- BFCL: function calling
Total samples:
- HumanEval: 164
- HellaSwag: 100
- BFCL: 400
Results:
BF16
- HumanEval: 56.10% 92/164
- HellaSwag: 90.00% 90/100
- BFCL: 63.25% 253/400
- Avg accuracy: 69.78%
- Throughput: 15.5 tok/s
- Peak RAM: 54 GB
- Model size: 53.8 GB
Q4_K_M
- HumanEval: 50.61% 83/164
- HellaSwag: 86.00% 86/100
- BFCL: 63.00% 252/400
- Avg accuracy: 66.54%
- Throughput: 22.5 tok/s
- Peak RAM: 28 GB
- Model size: 16.8 GB
Q8_0
- HumanEval: 52.44% 86/164
- HellaSwag: 83.00% 83/100
- BFCL: 63.00% 252/400
- Avg accuracy: 66.15%
- Throughput: 18.0 tok/s
- Peak RAM: 42 GB
- Model size: 28.6 GB
What stood out:
Q4_K_M looks like the best practical variant here. It keeps BFCL almost identical to BF16, drops about 5.5 points on HumanEval, and is still only 4 points behind BF16 on HellaSwag.
The tradeoff is pretty good:
- 1.45x faster than BF16
- 48% less peak RAM
- 68.8% smaller model file
- nearly identical function calling score
Q8_0 was a bit underwhelming in this run. It improved HumanEval over Q4_K_M by \~1.8 points, but used 42 GB RAM vs 28 GB and was slower. It also scored lower than Q4_K_M on HellaSwag in this eval.
For local/CPU deployment, I would probably pick Q4_K_M unless the workload is heavily code-generation focused. For maximum quality, BF16 still wins.
Evaluation setup:
- GGUF via llama-cpp-python
- n_ctx: 32768
- checkpointed evaluation
- HumanEval, HellaSwag, and BFCL all completed
- BFCL had 400 function calling samples
This evaluation was done using Neo AI Engineer, which built the GGUF eval setup, handled checkpointed runs, and consolidated the benchmark results. I manually reviewed the outcome as well.
Complete case study with benchmarking results, approach and code snippets in mentioned in the comments below š
Party-Log-1084@reddit
Always appreciate the numbers. Q4_K_M still seems to be the sweet spot for daily driving. I literally can't tell the difference between Q8 and BF16 in normal use anyway, so BF16 is just a waste of VRAM unless you're fine-tuning.
AlwaysLateToThaParty@reddit
Really appreciate that data. Surprising results for me.
PassengerPigeon343@reddit
Really like seeing this kind of comparison across quants, I feel like we need more of that kind of analysis on here. Thanks for doing this!
klicker0@reddit
http://quanteval.ai
jirka642@reddit
They need better benchmarks. I doubt that Q2 of Gemma-4-31B has the same quality as Q8.
ioabo@reddit
This is such a great site, the work behind it is impressive.
Zestyclose_Piece_427@reddit
This is amazing !
CryptoUsher@reddit
so the 4-bit quant takes a small hit on code generation but barely budges on function calling, which is interesting given the size drop
have you tried running the same eval with dynamic batch sizes to see how latency scales under load, or is this strictly single-stream throughput
Some_thing_like_vr@reddit
This sounds oddly ai generated
CryptoUsher@reddit
yeah i get that, my writing does lean a bit too clean sometimes, probably from typing fast on mobile. fwiw i'm actually running these tests on a 4090 with vllm, batch sizes from 1 to 16. latency holds up pretty well up to batch 8, then starts crawling. was gonna push to 32 but memory starts getting tight with longer sequences.
EpicSpaniard@reddit
Please go outside and talk to someone real..it doesn't sound anything like the output produced from an AI.
Some_thing_like_vr@reddit
Bro you're literally in rlocallama geeking out over gguf quants and telling people to go outside lol, you're completely oblivious if you cant spot the obvious chatgpt tone in that first comment. It reads exactly like some bot trying to act interested and I've seen many of themĀ
Shawnj2@reddit
Iād like to see comparison across like 12B BF16 vs 32B IQ2 where different models take up the same memory but are different sizes and quants
DeepOrangeSky@reddit
Well, best of all would be to just see more of the kind they did in this thread, of a model vs small quants of itself (but going all the way down to Q2 or Q1 or whatever, not just down to Q4), and then that let's you both know how much it is affected by the quantization relative to larger quants of the same exact model and let's you compare to other models like what you're asking about (so, two birds with one stone) since if you have the results on numerous main models in this way then you also get to compare them vs each other, too (as well as against themselves, too).
So, I actually like this the most, but I just wish they would go even further down to like Q2_xxs with it
Glittering-Call8746@reddit
Why not q6.. I think q6 if can fit on vram would be best ? Compare imatrix quants (ik_llama.cpp) too ?
ioabo@reddit
Yeah, would be nice to see how Q6 and imatrix quants perform. Though I assume if Q4 is so close to Q8, then Q6 shouldn't be so far away?
Glittering-Call8746@reddit
One question though do benchmarks accurately capture the nuance of the accuracy ? One character makes a difference if I'm coding and the things doesn't compile ie. Been using gemini flash as of late even it screws up on command line with one character missing.. and it goes on forever loops.. (not agentic)
MaCl0wSt@reddit
yeah I'd love a standardized, solid way to measure quant differences. those of us with very constrained systems need to make a lot of tests trying to figure out the tradeoffs of latency and quality/consistency (with every new model that comes out..), and it takes time to get reliable results
doc-acula@reddit
Gemma4 would be nice.
Far-Low-4705@reddit
Yes, Gemma 4 is much more sensitive to quantization iirc
stddealer@reddit
It's be shown KLD increses a lot more with quantization for gemma4 model, but how much does it actually degrade performance on real world use cases remains to be shown.
IrisColt@reddit
This!
cleversmoke@reddit
Great work! I'm using the Qwen3.6-27B Q5_K_XL variant myself as it fits nicely on a RTX 3090 24G with 96k context and q8_0 KV cache. I'm quite blown away by its ability to follow directions, analyze data, and give solid output.
I've stopped using the Qwen3.6-35B-A3B due to it hallucinating even on the first 10k tokens! I do miss the speed though, but I rather wait for better output from 27B than having to run 35B-A3B multiple times.
YourNightmar31@reddit
You're using a headless 3090 im guessing? With Windows overhead i was only able to get the Q4 XL working with 131k k turbo3 and v turbo2 context. Had like 500mb vram left.
cleversmoke@reddit
Yea I'm on headless with no display and minimal overhead. Display and Windows is done by the iGPU. I haven't tried 27B Q4, but 27B Q5 is significantly better than 35B-A3B Q4.
Are you considering pairing a smaller card to serve your Windows overhead so you can go headless 3090?
YourNightmar31@reddit
Actually i lied about my setup, just checked, i ended up on Q4XL with 161k context with both k and v on turbo4.
I have thought about it but i am waiting for Forza Horizon 6 which i need my 3090 for soo š what i might do is dual boot Ubuntu so when im working from home i can boot into Windows and work like that with Qwen3.6 27B running in a docker container, and then when im at work i can boot my pc into Ubuntu before i leave and run Q5 of Qwen3.6 27B instead of Q4 because of less vram overhead and then access it through Tailscale.
cleversmoke@reddit
Nice! Yea that sounds gnarly. I have Qwen 27B and OpenCode on docker containers to keep agents isolated too.
9r4n4y@reddit
Op you are doing good work. Keep comparing the different quants
90hex@reddit
Fantastic insights. Everybody wants to know exactly this: how are quants compared to the full-precision. Good job!
somerussianbear@reddit
Finally being poor pays off
Chance_Value_Not@reddit
Need to test more quants. Im pretty sure there is a sweet spot somewhere other than 4km
ZealousidealBadger47@reddit
Nice graph, hope to add more Q (e.g. 2, 5 & 6)
nathandreamfast@reddit
One thing I'd like to see is reasoning/CoT analysis. I've found when benchmarking different abliterated models, that sometimes CoT can get cut off with token limits and not produce a result which degrades the overall score.
Is this the case here? Some transparency to how these are handled, if the full CoT is provided or if reasoning was cut off would be good to see. Sometimes these smaller models get stuck in loops which exhaust the reasoning.
Some of these tests may not use CoT and use logliklihood too, so it may not apply to them. The coding benchmarks I imagine will need to use CoT.
One_Key_8127@reddit
Gemma 3 4B is over a year old and scores more than this on HumanEval. Llama3-8b also scores better on HumanEval. I think something is very wrong with these numbers... Qwen3.6 27b should be scoring 85%+, not \~50%.
https://evalplus.github.io/leaderboard.html
https://llm-stats.com/benchmarks/humaneval
AA72ON@reddit
Yea the ram is wrong too. Q8 is not 42gb
jeffwadsworth@reddit
Yeah, that was weird. Not to mention the Q8 scoring less than the 4. That would indicate a problem with the Q8.
Exciting_Garden2535@reddit
Yeah, another suspicious data is Peak RAM and Model side, so I suppose, the context size will be Peak minus Model:
cyan2k2@reddit
Yes something is very wrong with whatever OP is doing. 50% humaneval is proper āabsolutely unusableā tier.
cosmicnag@reddit
How is Q8 worse than Q4 in some tests?
robertpro01@reddit
It can't be
Dabalam@reddit
Certainly can be. Depends on the test.
The base model being the best overall is a reasonable assumption to hold, but it doesn't hold that we should expect every individual kind of task to have measurable degradation. This opens up a possibility that by pure chance quantized models might appear slightly better on some individual benchmarks that aren't impacted by much by quantization (there may be no statistically significant difference between them in reality).
This is why this kind of analysis is more meaningful than just testing deviations from BF16 model answers, because deviation from the original models predictions isn't always harmful to a given task.
suicidaleggroll@reddit
That just means itās a sampling error, and OP didnāt perform enough iterations in their test to produce reliable results.
Dabalam@reddit
There's multiple kinds of randomness that could be impacting the results. If it were just sampling error could mean that the results are straight up incorrect and that Q8 in reality is better at the benchmark than Q4. That's not the only possibility though.
It could also be that by chance quantization really has no affect on this particular benchmark even if you were to sample it multiple times. The Q4 model could really be equivalent because by chance quantization doesn't matter for this particular task. This combined with sampling error could give a false impression that quantization boosts performance since all the separation is produced by error. It's also an edge possibility (although I find this less plausible as I can't explain why it should occur) that quantization degrades performance overall but happens to boost performance on some particular benchmarks.
Again this governed by randomness because the more tests you do the more likely you are to see weird edge cases even if quantization is bad overall (i.e. we aren't adjusting for multiple comparisons).
suicidaleggroll@reddit
Right, equivalent, but not better. The fact that Q4 appears to be better means it's sampling error, and we really don't know what the real numbers are. Is Q4 similar to Q8 on this bench, or is it worse, and if so, how much worse? We don't know the error bars, all we can tell from this data is that they're obviously fairly large.
EmPips@reddit
There's always the near-zero chance that the reduced precision from what Alibaba targeted resulted in the weights of a totally different superior model through complete randomness.
About the same odds as me plugging integers into a text file and making a SOTA-competitor.. BUT NOT ZERO.
robertpro01@reddit
Well, yes, that makes sense, in that case I still dunno the test is wrong, as it should consider that case
tavirabon@reddit
Easy, they weren't tested correctly or enough.
AppleBottmBeans@reddit
Its exactly the same reason why some people still claim 3.6 is on par with Opus 4.7 lol
stddealer@reddit
Huge margin of errors, because not enough runs.
Farmadupe@reddit
The entire post is complete hallucinated slop. OP didn't provide any actual data to back up their claims, they didn't describe their test harness. they didn't even tell us which tests they selected.
Eyelbee@reddit
These benchmarks are just q/a datasets, it's normal to have noise.
fgp121@reddit
My guess is that Q4 did 3 points better than Q8 in hellaswag 100 sample set otherwise it was poor than that. I think it's a sampling thing. Hellaswag has 1000s of samples as the complete dataset. But this is Ig a mini evaluation on that where Q4 performed better in some of the samples in that 100 sample set.
Embarrassed_Adagio28@reddit
Thanks! Could you please compare against q5 too? Since it is what qwen suggests to use at minimum for coding.Ā
ClearApartment2627@reddit
The real question is, what are the differences on 32k, 64k or 128k contexts. Aggressive quants (<q6) do fine on benchmarks that often use4k contexts.Ā They fall apart in real life, when you need to handle 30k or more.
Far-Low-4705@reddit
THIS IS WHAT WE NEED!!!
llitz@reddit
Every time I downgrade to q8 the model makes more mistakes. Bf16 still makes tons of mistakes butr whatever reason, it takes longer for me to spot them.
Far-Low-4705@reddit
What model do you use? What is your use case?
llitz@reddit
At the moment, a mix of coding in a few languages and sysadmin configurations, which involves either reading and understanding documentation or researching.
Oficial 27b model, tried fp8, and some int8 variants.
Keeping KV at 16 helps, but I do have to make more manual corrections when going down. The int8 seems slightly better, but mtp is broken on it so I can't be bothered to use it for long.
Fp8 works faster (and allow more context), so I am already mentally prepared to manually steer mistakes in configurations.
It feels like... Close to what we had with ChatGPT-5.2, or maybe the previous one. But qwen does tend to easily go out the rails above 128k context.
Splitting the work in smaller chunks and running a few extra fresh validation sessions catches most stuff, which is the tradeoff: qwen uses tons of "tokens" reasoning and trying to enable preserve_thinking gobbles up context window in a few minutes.
I was launching the same operation a few times in a row and qwen would, between reasoning and other stuff, advertise 300k tokens total in the session. For some reason, qwen got crazy in the main session and spawned a chatgpt-5.5 session: 40k total tokens, same prompt spot checking a plan.md and generating a better revised_plan.md (it is expected to be better, but using 14% of the budget was unexpected). Qwen is noisy
Btw, mine also goes into... Yelling spree? Like it was mostly trained on this one frustrated dev's data -_-
Far-Low-4705@reddit
uh, idk honestly.
to me it seems like the difference between fp16 and q8 is so niche that it could honestly just be the underlying small model, not a actual quality loss from quantization. you might be inclined to think it's there, and search for it more with q8, and therefore notice more mistakes. that would be my guess.
also, if you want performance, and use anything beyond 8k tokens, do not use KV cache quantization at all. if you are fine with an obvious performance hit, and need the extra context, go for it, but i wouldn't use it otherwise, id MUCH rather use a lower quant.
as for token usage, yeah it does use a lot of tokens. you might want to consider using the instruct mode to save on tokens. maybe use the thinking mode to generate a general plan, then instruct mode to execute it. you'd probably save a lot on tokens
Also your harness affects performance A LOT, so try out different harnesses.
For me, honestly 35b is good enough. i need the speed. it has less "raw intelligence" on really hard STEM like problems, but imo, it is just as good at agentic tasks. also 27b just runs too slow for me, 20 T/s aint it.
GCoderDCoder@reddit
I love unsloth's versions and they are the best. However to be clear though, a 5% difference in the score with around a 50% rating is a 10% difference from the original compounded over the life of the context growing exponentially and you can have a lot of divergence where q4 does not feel like the model that they reference in the initial benchmarks. For models we see how a few percent in benchmarks feels significant and unsloth is the best so imagine q4 from other providers would likely be 15% or more I'd imagine.
audioen@reddit
No error bars in these measurements. We know that Q4_K_M is not likely to better than Q8_0 and the fact benchmark ordered them in this order at least once raises the question of how much this is just sampling error, then.
ikkiho@reddit
Worth quantifying the error bar intuition because ranking actually depends on it. N=100 (HellaSwag) gives Wilson 95% CIs of roughly +/- 7-10pp, N=164 (HumanEval) is +/- 7-8pp, N=400 (BFCL) is +/- 4-5pp. Almost every gap between BF16, Q4_K_M and Q8_0 in the table sits inside the noise floor for HellaSwag and HumanEval, including the Q8_0 < Q4_K_M inversion everyone's flagging. One seed at N=100 isn't enough to call the order; you'd want 3-5 seeds per quant or much larger eval sets (HumanEval+ 2k augmented variants, full HellaSwag val at 10k).
Two related notes:
The bigger oddity is the HumanEval-vs-leaderboard gap. Qwen3.6-27B published at ~85% pass@1, OP is at 50-56% on BF16. That's not quant degradation, that's eval setup. Probable culprits: prompt template (Qwen instruct uses a specific chat schema, a raw completion-style HumanEval harness tanks scores), stop tokens (missing
<|im_end|>produces runaway generations that fail compile), sampling (canonical HumanEval is greedy temp=0 with eval+ pass@1 protocol). Worth running against eval+ canonical wrappers before publishing the BF16 number as the baseline.Q8 < Q4 specifically: Q4_K_M uses an importance matrix during quantization that can upweight token frequencies the eval happens to probe, while Q8_0 is uniform per-block. On a small-N HellaSwag this can flip ranking purely from imatrix calibration data choice, which is a calibration artifact rather than a precision artifact. Posting the imatrix dataset and KV cache type alongside weight quant would explain a lot of variance.
llama-impersonator@reddit
yeah, hellaswag is an eval with known high variance.
Exodus124@reddit
How do you think error bars work?
wes_medford@reddit
Single pass through each eval with a single seed doesnāt tell much unless gaps are large. In this case, a reversal in perf between Q8 and Q4 would need an explanation, and sample bias could explain it.
These tests are noisy, so a true comparison should use multiple seeds for each eval. A box and whisker for each, with median reported, would provide more signal. Basically right now we could be looking at p1 for q8 and p99 for q4 with no way to tell the difference.
autonomousdev_@reddit
i run q4 on my 4090 for coding and honestly its basically as good as q8 for what i need. the bf16 model was way overkill like triple the ram for barely any difference. q4 is fine for local dev unless youre doing actual research
mister2d@reddit
This report is meaningless without stating which exact models were tested.
autonomousdev_@reddit
qwen 3.6 27b is the first time i got usable output from q4_k_m. ran it on 30 emails for edge case extraction and q8_0 caught 2 things q4 missed. fine for chatting but dont use it for actual data work.
dionysio211@reddit
This is very interesting. It surprises me that there is such a difference between Q8 and BF16, which I would normally consider close to lossless. I know that these are all small differences but a 3.7 point drop (5.5 point drop to Q4_K_M) seems considerable right? It's a 6%/10% loss in accuracy which is almost a generational difference it seems. For a dense model, in particular, this does seem surprising to me. Another surprising aspect of this is that BFCL uses about 10x more context than the other two per question and it has the smallest difference between quantizations. Some of this could come down to sample size too I suppose. Unsloth is obviously top of the game in these things and the information is very appreciated.
We have some spare compute currently. I may run a few quants through these and some other benchmarks to see how different types of quants fare.
Temporary-Mix8022@reddit
One thing I am still dying to know.. and Gemma might be a good one to do it on.
Does say, a 4bit quant of a large model, beat an bf16 or 8bit quant of a small model?
What about dense vs MOE on a similar basis?
A lot of us are RAM constrained, and/or compute.. and it'd be pretty interesting
PetToilet@reddit
Yup for me 35B A3B Q8 runs like 3-8x faster than 27B IQ3
iMakeSense@reddit
r/povertyLocalLLaMA
chr0n1x@reddit
thanks for this, Id love to see something like this for the 35B-MoE-A3B!
Fedor_Doc@reddit
Q4_K_M as best tradeoff is not the key insight, if you look at the data. It is a common knoweledge.
What is interesting though, is that Q8_0 perform worse on HellaSwag than Q4_K_M.
Possible causes: 1. The benchmarks are run only once, is does not account for run-to-run variations. If this is the case, we do not know if model quality has degraded or specific runs were just not lucky enough. Is it pass 1? 2. HellaSwag is a bad / contaminated benchmark that does not correlate with the model quality. 3. Q_8 quant / inference settings were not optimal 4. Uniform Q_8 can damage model more than Q4_K_M
Please, review data yourself before writing conclusions. You can ask LLMs about data points as well. Even big LLMs (e.g. Gemini 2.5 Pro in my experience) sometimes ignore data points that contradict initial or most common hypothesis.
hurdurdur7@reddit
I share your sentiment, probably testing was not done correctly. It makes very little mathematical sense how q8 can be behind any q4 in correctness/accuracy. I would guess q8 got unlucky and q4km got lucky in that testrun.
Dabalam@reddit
Lucky sampling is one explanation. It doesn't seem impossible that different kinds of compression might by chance result in predictions that are better for a given benchmark. HellaSwag are multiple choice questions on common sense scenarios. It isn't clear we can say how we should expect different quantization to impact this particular domain even if Q8 is better overall.
Dabalam@reddit
Contamination alone isn't a good explanation for why Q4 would perform better than Q8. There isn't a clear reason why contamination should favour one quant over the other. It might lead to ceiling effects, but you would still 1 of the other causes to be true e.g. ceiling effects from contamination amplify problems arising from causes 1 (chance), 3 (biased setting) or 4 (damaging quantization process).
misha1350@reddit
Incorrect comparison. There are various publishers on HuggingFace, and it's always better to use the weights from Bartowski and Unsloth and others. Unsloth usually publishes good graphs showing the KLD results for many of the newer models, and the weights from the likes of LM Studio consistently have the worst quality loss.Ā
Try to compare not just Q4, but also Q5 quants as well. Q4_K_L and Q4_K_XL quants would be the better ones, and Q5_K_M/ Q5_K_L/Q5_K_XL are the sweetspot, especially for MoE models with less than 5B active parameters.
iMakeSense@reddit
Could you give an example link to a decent comparison?
stddealer@reddit
Imatrix quants are inherently biased, and they're really only better if you're using them for tasks that are similar to the one on the imatrix training set. If you're using a different language, it will be worse for example.
ivoras@reddit
Where's the 2.3x throughput increase (the "key insight" from the image), if BF16 runs at 15.5 tps, and Q4_K_M runs at 22.5 TPS? That's about 45% increase, as it says on the lower-right box in the image?
Would it be correct to state that the quant-derived performance improvement is almost entirely because of memory footprint reduction?
Eisenstein@reddit
also, a change from 56.1% correct to 50.6% correct on human eval is not a 5.5% accuracy drop, it is a 10% accuracy drop.
llitz@reddit
That was bothering me so much...
Iory1998@reddit
It's about time you add Q6_K_M to the mix, please.
dpenev98@reddit
Thank for this experiment! I've been looking for this exact type of benchmarks.
Can you share your full hardware setup?
spaceman_@reddit
Very interesting to see these kinds of evals. Kind of surprised at the "damage" done to the Q8_0 model.
Are you guys planning to run these against other models as well? (other Qwen3.6 sizes or just a different model family, curious about either)
stddealer@reddit
Q8_0 being that much worse than Q4_K_M at hellswag just shows that the margins of error are huge for this test.
Content_Bite_4191@reddit
ofc margins would be huge on 1x pass. this comparison is totally unreliable
Monad_Maya@reddit
The blog doesn't really have any additional details, not even the prompt given to "Neo Engineer".
MelodicRecognition7@reddit
this thread is a smart advertisement to avoid deletion for breaking "limit self promotion" rule.
Monad_Maya@reddit
Certainly but the results are interesting.
I'm checking https://openbenchmarking.org/test/pts/llama-cpp&eval=db257a7755adbe206b4709bcdd5c5fdb23ba90fa#metrics and the OP's results are interesting to say the least.
Closest CPU on that chart is TR Pro 9975wx but it has 8ch memory and the OP's Epyc 9965 has 12ch memory. He's using a VM with 32vCPUs, maybe there's some overhead.
vulcan4d@reddit
This is very nice testimg. Those UD quants seriously need testing to see if they are all they are cracked up to be.
Maheidem@reddit
This great, really liked seeing. But imagine if it kept going all the way to a Q2 or something
ArugulaAnnual1765@reddit
I wonder how much better iq4_nl is than q4_k_m
SosirisTseng@reddit
Would like to know that, too. Unsloth only has the results of the MoE model: https://unsloth.ai/docs/models/qwen3.6#unsloth-gguf-benchmarks
ai_without_borders@reddit
useful benchmark but it's evaluating only one dimension of the quality-cost tradeoff. in practice the decision isn't just which weight quant, it's the joint allocation of your VRAM budget across weights, KV cache, and context. a Q4_K_M model with Q8_0 KV at 32k context has a very different quality profile than Q8_0 weights with Q4_0 KV at 16k context -- same hardware, wildly different operating points. the weight quant is usually the smaller quality hit compared to aggressive KV compression, which most evals skip. would be curious to see this extended with kv quant as a variable, especially at longer context lengths where the KV budget starts dominating.
MrMisterShin@reddit
Now include AWQ and FP8
Quirky_Inflation@reddit
That's just garbage in a graph
Equivalent-Ear-8016@reddit
Finally someone did the tests instead of guessing. I was tired of reading opinions on this sub without any substance behind them.
chitown160@reddit
and yet MXP8 and MXP4 are still slept on by Blackwell owners and also this benchmark.
xrvz@reddit
OP, do you also use the YYYY/DD/MM date format?
Eyelbee@reddit
How did you do the humaneval? Scores seem low
daily_spiderman@reddit
Agreed!
Pretend_Engineer5951@reddit
That's very strange results.
What kv cache quant did you use with llama.cpp? FYI: default f16 have an issue https://github.com/ggml-org/llama.cpp/issues/20035 . Unsloth recommends to use bf16 or q8.
And did you use base models or unsloth?
himefei@reddit
No one is questioning about how many years it took to complete these tests???
Look_0ver_There@reddit
What was the KV cache quantization used for each test?
Tagedieb@reddit
A similar comparison with different kv quants would also be useful.
ea_man@reddit
What bothers me the most with this release it the model size:
Now with QWEN3.6 you can't fit a Q4_k_m on a 16GB gpu and IQ3_XSS is borderline usable on a 12GB. Those are the smallest ones btw: https://huggingface.co/mradermacher/Qwen3.6-27B-i1-GGUF , unsloths quants are bigger.
3.5 was slightly smaller, I'd hope that next time they make like a \~24B version.
Ki1o@reddit
I'd love to see a benchmark that shows actual complex task completion with multi step + tool calls for these different models.
My instinct that I'd love to get data to prove is that minor reductions in quality from quantisations are more than made up for in increased token generation speed. Ultimately faster token output and rework feels like it would end up faster than slower token output (but high bench scores) plus rework
SmartCustard9944@reddit
Bad tokens in previous conversations pollute the context
bnolsen@reddit
q8_k_xl vs q8_0 ?
someone383726@reddit
Thanks for providing this service!
SmartCustard9944@reddit
You donāt mention who provides the quant. Also, would be interesting to measure hallucination rate, and tool calling accuracy, because it feels like these are some of the first things to go with quants.
Intelligent_Ice_113@reddit
this should be even better with dynamic quants and blinded model (to take up less RAM for code only tasks) š¤
WhoRoger@reddit
Yass, this is much more useful than the synthetic KLD number. Q4 doing better than Q8 in some evals is interesting.
But I'd be careful about generalising the conclusions, especially since only Q4 and Q8 are compared here. Q6 may be the sweet spot with other models (especially the smaller ones). And then there's imatrix.
gvij@reddit (OP)
Complete Qwen 3.6 27B evaluation case study with benchmarking results, approach and code snippets are mentioned here: https://heyneo.com/blog/evaluating-qwen-3-6-27b-benchmarking-case-study
MelodicRecognition7@reddit
bullshit link: no code snippets, no plain text results, just an advertisement of that "Neo Engineer".
spaceman_@reddit
I was very interested but also couldn't find any code to reproduce the results of the study. Kind of defeats the purpose, because it's so easy to make procedural mistakes. Or have the LLM generated llama.cpp flags be wrong for one of the tests.
MelodicRecognition7@reddit
the first thing I wanted to do is to run this test on my hardware to verify the results because Q4 quant performing better than Q8 smells like AI hallucination.
Farmadupe@reddit
Yeah 100% this looks completely hallucinated. Q8_0 taking 42G vram is totally ballpark... BF16 taking 54G? No. Not even close. The weights on HF are bigger than 54G.
Also needs to tell us what version of llama.cpp and the subset of tests that was chosen. And to not synthesize "Q4" is best garbage right next to the charts.
Absolutely totally hallucinated.
spaceman_@reddit
Wow I didn't even notice that.
gvij@reddit (OP)
Mind your language. The blog might have missed adding the github repo link. Here it is: https://github.com/gauravvij/slm_eval_harness
Has the complete evaluation pipeline and steps.
Fedor_Doc@reddit
Qwen 3.5 35B A3B results fron github README show that Q4_K_M shows the best results in benchmarks compared to Q8_0 and BF16.Ā
Have you analysed the results? Have quantization increased model performance?
pepedombo@reddit
Usually q8 is a bit slow, the gap seems to be very low though. I've found q5/q6 can loose some detail when prompted against q8 in coding. We need stronger benchmark which makes the differences more visible.
magnus-m@reddit
are these benchmarks multi-turn agent like?
estrafire@reddit
You should try kv caches too, q4_0, q4_1, q5_1 and q8_0
autonomousdev_@reddit
ran qwen 2.5 32b q4 vs q8 on my 3090 last week. q8 was like 10% better at keeping track of stuff over long context but it ate an extra 3gb of vram. for coding they felt the same honestly. just stuck with q4 and saved the headroom for tool calls
CheatCodesOfLife@reddit
bots love qwen 2.5 lmao //add 3 minute sleep to avoid rate limit
No_Dig_7017@reddit
This is awesome! Thanks for sharing!
UncleRedz@reddit
I'm missing the source of those quants, was it unsloth? Something else?
What's becoming very clear is that the old method of applying a quant across the board is not the way to do it anymore, some parameters are more important than others. This also means that how this quant was made, is very important for determining actual quality after quantisation.
Also the test samples here are unfortunately too small, 100 questions for each benchmark is not enough, you need to run the full benchmark. As an example, MMLU has something like 14.000 questions.
Last feedback, you are missing a failure counter, not just pass / fail on a test, but a third state, on did the model answer but it did not comply with instructions and answered in the wrong format or went off the rails? As a model is more heavily quantized this error state goes upp and can cause all sorts of unexpected issues, so its good to keep track of in any benchmarking.
PiaRedDragon@reddit
Try it again with t=0.3
Healthy-Nebula-3603@reddit
Nice
Thanks
ggGeorge713@reddit
Would love to see SWE Bench verified in there.
Any chance you tested that as well?
2Norn@reddit
all i need in my life is prismml to do the same ternary shit on 3.6 27b
fgp121@reddit
So I guess Q4_k_m is the best one in terms of hardware efficiency vs quality trade off?
One_Key_8127@reddit
Gemma 3 4B, released over 1 year ago, scores 71% on HumanEval. I think something is very much off with this result on HumanEval.
nunodonato@reddit
are these unsloth's quants?
Monad_Maya@reddit
What's the hardware setup other than the generic 32 vCPU and 125GB RAM? There are no details about how you measured throughout/TTFT etc and at what context size.
Additionally was the KV cache quantized?
gvij@reddit (OP)
No explicit KV cache quantization was applied.
Monad_Maya@reddit
That's helpful, thanks.
Current_Ferret_4981@reddit
Would be very curious to see how Q8, Q6, Q5, Q4, Q3 compare to see when the drop off really waterfalls. Seems like there is another nominal hit around Q5 or Q4 and then falls off at Q3?
LeonidasTMT@reddit
Could you also test IQ3_XXS?
Farmadupe@reddit
There's some really dubious results in your graphs.
The VRAM footprint for one thing. The weights alone for the 27b model are bigger than 27G at Q8 and 54G at BF16. Can you give us actual data behind this?
Constandinoskalifo@reddit
Thanks for posting! Could you test the 35B one as well?
gvij@reddit (OP)
Yes, we have evaluated it already. Results should be out soon.
sagiroth@reddit
Basically run the highest possible quant with the desired context you can fit and dont dip below Q4 if possible. Only consider higher quant if you have leftover vram
elswamp@reddit
You did a great comparison!
CaptBrick@reddit
Thank you good sir! Could you also include results with and without cache quantization q8?
gvij@reddit (OP)
Sure, would be happy to experiment on the same and share the results soon on this š
TheCat001@reddit
Haha, nice, and after this Qwen funboys still gonna say my Q4 quants sucks? xD