Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation

Posted by gvij@reddit | LocalLLaMA | View on Reddit | 150 comments

Evaluated Qwen 3.6 27B across BF16, Q4_K_M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer.

Benchmarks used:

HumanEval: code generation
HellaSwag: commonsense reasoning
BFCL: function calling

Total samples:

HumanEval: 164
HellaSwag: 100
BFCL: 400

Results:

BF16

HumanEval: 56.10% 92/164
HellaSwag: 90.00% 90/100
BFCL: 63.25% 253/400
Avg accuracy: 69.78%
Throughput: 15.5 tok/s
Peak RAM: 54 GB
Model size: 53.8 GB

Q4_K_M

HumanEval: 50.61% 83/164
HellaSwag: 86.00% 86/100
BFCL: 63.00% 252/400
Avg accuracy: 66.54%
Throughput: 22.5 tok/s
Peak RAM: 28 GB
Model size: 16.8 GB

Q8_0

HumanEval: 52.44% 86/164
HellaSwag: 83.00% 83/100
BFCL: 63.00% 252/400
Avg accuracy: 66.15%
Throughput: 18.0 tok/s
Peak RAM: 42 GB
Model size: 28.6 GB

What stood out:

Q4_K_M looks like the best practical variant here. It keeps BFCL almost identical to BF16, drops about 5.5 points on HumanEval, and is still only 4 points behind BF16 on HellaSwag.

The tradeoff is pretty good:

1.45x faster than BF16
48% less peak RAM
68.8% smaller model file
nearly identical function calling score

Q8_0 was a bit underwhelming in this run. It improved HumanEval over Q4_K_M by \~1.8 points, but used 42 GB RAM vs 28 GB and was slower. It also scored lower than Q4_K_M on HellaSwag in this eval.

For local/CPU deployment, I would probably pick Q4_K_M unless the workload is heavily code-generation focused. For maximum quality, BF16 still wins.

Evaluation setup:

GGUF via llama-cpp-python
n_ctx: 32768
checkpointed evaluation
HumanEval, HellaSwag, and BFCL all completed
BFCL had 400 function calling samples

This evaluation was done using Neo AI Engineer, which built the GGUF eval setup, handled checkpointed runs, and consolidated the benchmark results. I manually reviewed the outcome as well.

Complete case study with benchmarking results, approach and code snippets in mentioned in the comments below 👇

[-]

Party-Log-1084@reddit

Always appreciate the numbers. Q4_K_M still seems to be the sweet spot for daily driving. I literally can't tell the difference between Q8 and BF16 in normal use anyway, so BF16 is just a waste of VRAM unless you're fine-tuning.

[-]

AlwaysLateToThaParty@reddit

Really appreciate that data. Surprising results for me.

[-]

PassengerPigeon343@reddit

Really like seeing this kind of comparison across quants, I feel like we need more of that kind of analysis on here. Thanks for doing this!

[-]

klicker0@reddit

http://quanteval.ai

[-]

jirka642@reddit

They need better benchmarks. I doubt that Q2 of Gemma-4-31B has the same quality as Q8.

[-]

ioabo@reddit

This is such a great site, the work behind it is impressive.

[-]

Zestyclose_Piece_427@reddit

This is amazing !

[-]

CryptoUsher@reddit

so the 4-bit quant takes a small hit on code generation but barely budges on function calling, which is interesting given the size drop

have you tried running the same eval with dynamic batch sizes to see how latency scales under load, or is this strictly single-stream throughput

[-]

Some_thing_like_vr@reddit

This sounds oddly ai generated

[-]

CryptoUsher@reddit

yeah i get that, my writing does lean a bit too clean sometimes, probably from typing fast on mobile. fwiw i'm actually running these tests on a 4090 with vllm, batch sizes from 1 to 16. latency holds up pretty well up to batch 8, then starts crawling. was gonna push to 32 but memory starts getting tight with longer sequences.

[-]

EpicSpaniard@reddit

Please go outside and talk to someone real..it doesn't sound anything like the output produced from an AI.

[-]

Some_thing_like_vr@reddit

Bro you're literally in rlocallama geeking out over gguf quants and telling people to go outside lol, you're completely oblivious if you cant spot the obvious chatgpt tone in that first comment. It reads exactly like some bot trying to act interested and I've seen many of them

[-]

Shawnj2@reddit

I’d like to see comparison across like 12B BF16 vs 32B IQ2 where different models take up the same memory but are different sizes and quants

[-]

DeepOrangeSky@reddit

Well, best of all would be to just see more of the kind they did in this thread, of a model vs small quants of itself (but going all the way down to Q2 or Q1 or whatever, not just down to Q4), and then that let's you both know how much it is affected by the quantization relative to larger quants of the same exact model and let's you compare to other models like what you're asking about (so, two birds with one stone) since if you have the results on numerous main models in this way then you also get to compare them vs each other, too (as well as against themselves, too).

So, I actually like this the most, but I just wish they would go even further down to like Q2_xxs with it

[-]

Glittering-Call8746@reddit

Why not q6.. I think q6 if can fit on vram would be best ? Compare imatrix quants (ik_llama.cpp) too ?

[-]

ioabo@reddit

Yeah, would be nice to see how Q6 and imatrix quants perform. Though I assume if Q4 is so close to Q8, then Q6 shouldn't be so far away?

[-]

Glittering-Call8746@reddit

One question though do benchmarks accurately capture the nuance of the accuracy ? One character makes a difference if I'm coding and the things doesn't compile ie. Been using gemini flash as of late even it screws up on command line with one character missing.. and it goes on forever loops.. (not agentic)

[-]

MaCl0wSt@reddit

yeah I'd love a standardized, solid way to measure quant differences. those of us with very constrained systems need to make a lot of tests trying to figure out the tradeoffs of latency and quality/consistency (with every new model that comes out..), and it takes time to get reliable results

[-]

doc-acula@reddit

Gemma4 would be nice.

[-]

Far-Low-4705@reddit

Yes, Gemma 4 is much more sensitive to quantization iirc

[-]

stddealer@reddit

It's be shown KLD increses a lot more with quantization for gemma4 model, but how much does it actually degrade performance on real world use cases remains to be shown.

[-]

IrisColt@reddit

This!

[-]

cleversmoke@reddit

Great work! I'm using the Qwen3.6-27B Q5_K_XL variant myself as it fits nicely on a RTX 3090 24G with 96k context and q8_0 KV cache. I'm quite blown away by its ability to follow directions, analyze data, and give solid output.

I've stopped using the Qwen3.6-35B-A3B due to it hallucinating even on the first 10k tokens! I do miss the speed though, but I rather wait for better output from 27B than having to run 35B-A3B multiple times.

[-]

YourNightmar31@reddit

You're using a headless 3090 im guessing? With Windows overhead i was only able to get the Q4 XL working with 131k k turbo3 and v turbo2 context. Had like 500mb vram left.

[-]

cleversmoke@reddit

Yea I'm on headless with no display and minimal overhead. Display and Windows is done by the iGPU. I haven't tried 27B Q4, but 27B Q5 is significantly better than 35B-A3B Q4.

Are you considering pairing a smaller card to serve your Windows overhead so you can go headless 3090?

[-]

YourNightmar31@reddit

Actually i lied about my setup, just checked, i ended up on Q4XL with 161k context with both k and v on turbo4.

I have thought about it but i am waiting for Forza Horizon 6 which i need my 3090 for soo 😃 what i might do is dual boot Ubuntu so when im working from home i can boot into Windows and work like that with Qwen3.6 27B running in a docker container, and then when im at work i can boot my pc into Ubuntu before i leave and run Q5 of Qwen3.6 27B instead of Q4 because of less vram overhead and then access it through Tailscale.

[-]

cleversmoke@reddit

Nice! Yea that sounds gnarly. I have Qwen 27B and OpenCode on docker containers to keep agents isolated too.

[-]

9r4n4y@reddit

Op you are doing good work. Keep comparing the different quants

[-]

90hex@reddit

Fantastic insights. Everybody wants to know exactly this: how are quants compared to the full-precision. Good job!

[-]

somerussianbear@reddit

Finally being poor pays off

[-]

Chance_Value_Not@reddit

Need to test more quants. Im pretty sure there is a sweet spot somewhere other than 4km

[-]

ZealousidealBadger47@reddit

Nice graph, hope to add more Q (e.g. 2, 5 & 6)

[-]

nathandreamfast@reddit

One thing I'd like to see is reasoning/CoT analysis. I've found when benchmarking different abliterated models, that sometimes CoT can get cut off with token limits and not produce a result which degrades the overall score.

Is this the case here? Some transparency to how these are handled, if the full CoT is provided or if reasoning was cut off would be good to see. Sometimes these smaller models get stuck in loops which exhaust the reasoning.

Some of these tests may not use CoT and use logliklihood too, so it may not apply to them. The coding benchmarks I imagine will need to use CoT.

[-]

One_Key_8127@reddit

Gemma 3 4B is over a year old and scores more than this on HumanEval. Llama3-8b also scores better on HumanEval. I think something is very wrong with these numbers... Qwen3.6 27b should be scoring 85%+, not \~50%.

https://evalplus.github.io/leaderboard.html

https://llm-stats.com/benchmarks/humaneval

[-]

AA72ON@reddit

Yea the ram is wrong too. Q8 is not 42gb

[-]

jeffwadsworth@reddit

Yeah, that was weird. Not to mention the Q8 scoring less than the 4. That would indicate a problem with the Q8.

[-]

Exciting_Garden2535@reddit

Yeah, another suspicious data is Peak RAM and Model side, so I suppose, the context size will be Peak minus Model:

Quant	Peak RAM	Model size	Context size
BF16	54 GB	53.8 GB	0.2 GB
Q4	28 GB	16.8 GB	11.2 GB
Q8	42 GB	28.6 GB	13.4 GB

[-]

cyan2k2@reddit

Yes something is very wrong with whatever OP is doing. 50% humaneval is proper “absolutely unusable” tier.

[-]

cosmicnag@reddit

How is Q8 worse than Q4 in some tests?

[-]

robertpro01@reddit

It can't be

[-]

Dabalam@reddit

Certainly can be. Depends on the test.

The base model being the best overall is a reasonable assumption to hold, but it doesn't hold that we should expect every individual kind of task to have measurable degradation. This opens up a possibility that by pure chance quantized models might appear slightly better on some individual benchmarks that aren't impacted by much by quantization (there may be no statistically significant difference between them in reality).

This is why this kind of analysis is more meaningful than just testing deviations from BF16 model answers, because deviation from the original models predictions isn't always harmful to a given task.

[-]

suicidaleggroll@reddit

That just means it’s a sampling error, and OP didn’t perform enough iterations in their test to produce reliable results.

[-]

Dabalam@reddit

There's multiple kinds of randomness that could be impacting the results. If it were just sampling error could mean that the results are straight up incorrect and that Q8 in reality is better at the benchmark than Q4. That's not the only possibility though.

It could also be that by chance quantization really has no affect on this particular benchmark even if you were to sample it multiple times. The Q4 model could really be equivalent because by chance quantization doesn't matter for this particular task. This combined with sampling error could give a false impression that quantization boosts performance since all the separation is produced by error. It's also an edge possibility (although I find this less plausible as I can't explain why it should occur) that quantization degrades performance overall but happens to boost performance on some particular benchmarks.

Again this governed by randomness because the more tests you do the more likely you are to see weird edge cases even if quantization is bad overall (i.e. we aren't adjusting for multiple comparisons).

[-]

suicidaleggroll@reddit

It could also be that by chance quantization really has no affect on this particular benchmark even if you were to sample it multiple times. The Q4 model could really be equivalent because by chance quantization doesn't matter for this particular task.

Right, equivalent, but not better. The fact that Q4 appears to be better means it's sampling error, and we really don't know what the real numbers are. Is Q4 similar to Q8 on this bench, or is it worse, and if so, how much worse? We don't know the error bars, all we can tell from this data is that they're obviously fairly large.

[-]

EmPips@reddit

There's always the near-zero chance that the reduced precision from what Alibaba targeted resulted in the weights of a totally different superior model through complete randomness.

About the same odds as me plugging integers into a text file and making a SOTA-competitor.. BUT NOT ZERO.

[-]

robertpro01@reddit

Well, yes, that makes sense, in that case I still dunno the test is wrong, as it should consider that case

[-]

tavirabon@reddit

Easy, they weren't tested correctly or enough.

[-]

AppleBottmBeans@reddit

Its exactly the same reason why some people still claim 3.6 is on par with Opus 4.7 lol

[-]

stddealer@reddit

Huge margin of errors, because not enough runs.

[-]

Farmadupe@reddit

The entire post is complete hallucinated slop. OP didn't provide any actual data to back up their claims, they didn't describe their test harness. they didn't even tell us which tests they selected.

[-]

Eyelbee@reddit

These benchmarks are just q/a datasets, it's normal to have noise.

[-]

fgp121@reddit

My guess is that Q4 did 3 points better than Q8 in hellaswag 100 sample set otherwise it was poor than that. I think it's a sampling thing. Hellaswag has 1000s of samples as the complete dataset. But this is Ig a mini evaluation on that where Q4 performed better in some of the samples in that 100 sample set.

[-]

Embarrassed_Adagio28@reddit

Thanks! Could you please compare against q5 too? Since it is what qwen suggests to use at minimum for coding.

[-]

ClearApartment2627@reddit

The real question is, what are the differences on 32k, 64k or 128k contexts. Aggressive quants (<q6) do fine on benchmarks that often use4k contexts. They fall apart in real life, when you need to handle 30k or more.

[-]

Far-Low-4705@reddit

THIS IS WHAT WE NEED!!!

[-]

llitz@reddit

Every time I downgrade to q8 the model makes more mistakes. Bf16 still makes tons of mistakes butr whatever reason, it takes longer for me to spot them.

[-]

Far-Low-4705@reddit

What model do you use? What is your use case?

[-]

llitz@reddit

At the moment, a mix of coding in a few languages and sysadmin configurations, which involves either reading and understanding documentation or researching.

Oficial 27b model, tried fp8, and some int8 variants.

Keeping KV at 16 helps, but I do have to make more manual corrections when going down. The int8 seems slightly better, but mtp is broken on it so I can't be bothered to use it for long.

Fp8 works faster (and allow more context), so I am already mentally prepared to manually steer mistakes in configurations.

It feels like... Close to what we had with ChatGPT-5.2, or maybe the previous one. But qwen does tend to easily go out the rails above 128k context.

Splitting the work in smaller chunks and running a few extra fresh validation sessions catches most stuff, which is the tradeoff: qwen uses tons of "tokens" reasoning and trying to enable preserve_thinking gobbles up context window in a few minutes.

I was launching the same operation a few times in a row and qwen would, between reasoning and other stuff, advertise 300k tokens total in the session. For some reason, qwen got crazy in the main session and spawned a chatgpt-5.5 session: 40k total tokens, same prompt spot checking a plan.md and generating a better revised_plan.md (it is expected to be better, but using 14% of the budget was unexpected). Qwen is noisy

Btw, mine also goes into... Yelling spree? Like it was mostly trained on this one frustrated dev's data -_-

[-]

Far-Low-4705@reddit

uh, idk honestly.

to me it seems like the difference between fp16 and q8 is so niche that it could honestly just be the underlying small model, not a actual quality loss from quantization. you might be inclined to think it's there, and search for it more with q8, and therefore notice more mistakes. that would be my guess.

also, if you want performance, and use anything beyond 8k tokens, do not use KV cache quantization at all. if you are fine with an obvious performance hit, and need the extra context, go for it, but i wouldn't use it otherwise, id MUCH rather use a lower quant.

as for token usage, yeah it does use a lot of tokens. you might want to consider using the instruct mode to save on tokens. maybe use the thinking mode to generate a general plan, then instruct mode to execute it. you'd probably save a lot on tokens

Also your harness affects performance A LOT, so try out different harnesses.

For me, honestly 35b is good enough. i need the speed. it has less "raw intelligence" on really hard STEM like problems, but imo, it is just as good at agentic tasks. also 27b just runs too slow for me, 20 T/s aint it.

[-]

GCoderDCoder@reddit

I love unsloth's versions and they are the best. However to be clear though, a 5% difference in the score with around a 50% rating is a 10% difference from the original compounded over the life of the context growing exponentially and you can have a lot of divergence where q4 does not feel like the model that they reference in the initial benchmarks. For models we see how a few percent in benchmarks feels significant and unsloth is the best so imagine q4 from other providers would likely be 15% or more I'd imagine.

[-]

audioen@reddit

No error bars in these measurements. We know that Q4_K_M is not likely to better than Q8_0 and the fact benchmark ordered them in this order at least once raises the question of how much this is just sampling error, then.

[-]

ikkiho@reddit

Worth quantifying the error bar intuition because ranking actually depends on it. N=100 (HellaSwag) gives Wilson 95% CIs of roughly +/- 7-10pp, N=164 (HumanEval) is +/- 7-8pp, N=400 (BFCL) is +/- 4-5pp. Almost every gap between BF16, Q4_K_M and Q8_0 in the table sits inside the noise floor for HellaSwag and HumanEval, including the Q8_0 < Q4_K_M inversion everyone's flagging. One seed at N=100 isn't enough to call the order; you'd want 3-5 seeds per quant or much larger eval sets (HumanEval+ 2k augmented variants, full HellaSwag val at 10k).

Two related notes:

The bigger oddity is the HumanEval-vs-leaderboard gap. Qwen3.6-27B published at ~85% pass@1, OP is at 50-56% on BF16. That's not quant degradation, that's eval setup. Probable culprits: prompt template (Qwen instruct uses a specific chat schema, a raw completion-style HumanEval harness tanks scores), stop tokens (missing <|im_end|> produces runaway generations that fail compile), sampling (canonical HumanEval is greedy temp=0 with eval+ pass@1 protocol). Worth running against eval+ canonical wrappers before publishing the BF16 number as the baseline.

Q8 < Q4 specifically: Q4_K_M uses an importance matrix during quantization that can upweight token frequencies the eval happens to probe, while Q8_0 is uniform per-block. On a small-N HellaSwag this can flip ranking purely from imatrix calibration data choice, which is a calibration artifact rather than a precision artifact. Posting the imatrix dataset and KV cache type alongside weight quant would explain a lot of variance.

[-]

llama-impersonator@reddit

yeah, hellaswag is an eval with known high variance.

[-]

Exodus124@reddit

How do you think error bars work?

[-]

wes_medford@reddit

Single pass through each eval with a single seed doesn’t tell much unless gaps are large. In this case, a reversal in perf between Q8 and Q4 would need an explanation, and sample bias could explain it.

These tests are noisy, so a true comparison should use multiple seeds for each eval. A box and whisker for each, with median reported, would provide more signal. Basically right now we could be looking at p1 for q8 and p99 for q4 with no way to tell the difference.

[-]

autonomousdev_@reddit

i run q4 on my 4090 for coding and honestly its basically as good as q8 for what i need. the bf16 model was way overkill like triple the ram for barely any difference. q4 is fine for local dev unless youre doing actual research

[-]

mister2d@reddit

This report is meaningless without stating which exact models were tested.

[-]

autonomousdev_@reddit

qwen 3.6 27b is the first time i got usable output from q4_k_m. ran it on 30 emails for edge case extraction and q8_0 caught 2 things q4 missed. fine for chatting but dont use it for actual data work.

[-]

dionysio211@reddit

This is very interesting. It surprises me that there is such a difference between Q8 and BF16, which I would normally consider close to lossless. I know that these are all small differences but a 3.7 point drop (5.5 point drop to Q4_K_M) seems considerable right? It's a 6%/10% loss in accuracy which is almost a generational difference it seems. For a dense model, in particular, this does seem surprising to me. Another surprising aspect of this is that BFCL uses about 10x more context than the other two per question and it has the smallest difference between quantizations. Some of this could come down to sample size too I suppose. Unsloth is obviously top of the game in these things and the information is very appreciated.

We have some spare compute currently. I may run a few quants through these and some other benchmarks to see how different types of quants fare.

[-]

Temporary-Mix8022@reddit

One thing I am still dying to know.. and Gemma might be a good one to do it on.

Does say, a 4bit quant of a large model, beat an bf16 or 8bit quant of a small model?

What about dense vs MOE on a similar basis?

A lot of us are RAM constrained, and/or compute.. and it'd be pretty interesting

[-]

PetToilet@reddit

Yup for me 35B A3B Q8 runs like 3-8x faster than 27B IQ3

[-]

iMakeSense@reddit

r/povertyLocalLLaMA

[-]

chr0n1x@reddit

thanks for this, Id love to see something like this for the 35B-MoE-A3B!

[-]

Fedor_Doc@reddit

Q4_K_M as best tradeoff is not the key insight, if you look at the data. It is a common knoweledge.

What is interesting though, is that Q8_0 perform worse on HellaSwag than Q4_K_M.

Possible causes: 1. The benchmarks are run only once, is does not account for run-to-run variations. If this is the case, we do not know if model quality has degraded or specific runs were just not lucky enough. Is it pass 1? 2. HellaSwag is a bad / contaminated benchmark that does not correlate with the model quality. 3. Q_8 quant / inference settings were not optimal 4. Uniform Q_8 can damage model more than Q4_K_M

Please, review data yourself before writing conclusions. You can ask LLMs about data points as well. Even big LLMs (e.g. Gemini 2.5 Pro in my experience) sometimes ignore data points that contradict initial or most common hypothesis.

[-]

hurdurdur7@reddit

I share your sentiment, probably testing was not done correctly. It makes very little mathematical sense how q8 can be behind any q4 in correctness/accuracy. I would guess q8 got unlucky and q4km got lucky in that testrun.

[-]

Dabalam@reddit

Lucky sampling is one explanation. It doesn't seem impossible that different kinds of compression might by chance result in predictions that are better for a given benchmark. HellaSwag are multiple choice questions on common sense scenarios. It isn't clear we can say how we should expect different quantization to impact this particular domain even if Q8 is better overall.

[-]

Dabalam@reddit

Contamination alone isn't a good explanation for why Q4 would perform better than Q8. There isn't a clear reason why contamination should favour one quant over the other. It might lead to ceiling effects, but you would still 1 of the other causes to be true e.g. ceiling effects from contamination amplify problems arising from causes 1 (chance), 3 (biased setting) or 4 (damaging quantization process).

[-]

misha1350@reddit

Incorrect comparison. There are various publishers on HuggingFace, and it's always better to use the weights from Bartowski and Unsloth and others. Unsloth usually publishes good graphs showing the KLD results for many of the newer models, and the weights from the likes of LM Studio consistently have the worst quality loss.

Try to compare not just Q4, but also Q5 quants as well. Q4_K_L and Q4_K_XL quants would be the better ones, and Q5_K_M/ Q5_K_L/Q5_K_XL are the sweetspot, especially for MoE models with less than 5B active parameters.

[-]

iMakeSense@reddit

Could you give an example link to a decent comparison?

[-]

stddealer@reddit

Imatrix quants are inherently biased, and they're really only better if you're using them for tasks that are similar to the one on the imatrix training set. If you're using a different language, it will be worse for example.

[-]

ivoras@reddit

Where's the 2.3x throughput increase (the "key insight" from the image), if BF16 runs at 15.5 tps, and Q4_K_M runs at 22.5 TPS? That's about 45% increase, as it says on the lower-right box in the image?

Would it be correct to state that the quant-derived performance improvement is almost entirely because of memory footprint reduction?

[-]

Eisenstein@reddit

also, a change from 56.1% correct to 50.6% correct on human eval is not a 5.5% accuracy drop, it is a 10% accuracy drop.

[-]

llitz@reddit

That was bothering me so much...

[-]

Iory1998@reddit

It's about time you add Q6_K_M to the mix, please.

[-]

dpenev98@reddit

Thank for this experiment! I've been looking for this exact type of benchmarks.

Can you share your full hardware setup?

[-]

spaceman_@reddit

Very interesting to see these kinds of evals. Kind of surprised at the "damage" done to the Q8_0 model.

Are you guys planning to run these against other models as well? (other Qwen3.6 sizes or just a different model family, curious about either)

[-]

stddealer@reddit

Q8_0 being that much worse than Q4_K_M at hellswag just shows that the margins of error are huge for this test.

[-]

Content_Bite_4191@reddit

ofc margins would be huge on 1x pass. this comparison is totally unreliable

[-]

Monad_Maya@reddit

The blog doesn't really have any additional details, not even the prompt given to "Neo Engineer".

[-]

MelodicRecognition7@reddit

this thread is a smart advertisement to avoid deletion for breaking "limit self promotion" rule.

[-]

Monad_Maya@reddit

Certainly but the results are interesting.

I'm checking https://openbenchmarking.org/test/pts/llama-cpp&eval=db257a7755adbe206b4709bcdd5c5fdb23ba90fa#metrics and the OP's results are interesting to say the least.

Closest CPU on that chart is TR Pro 9975wx but it has 8ch memory and the OP's Epyc 9965 has 12ch memory. He's using a VM with 32vCPUs, maybe there's some overhead.

[-]

vulcan4d@reddit

This is very nice testimg. Those UD quants seriously need testing to see if they are all they are cracked up to be.

[-]

Maheidem@reddit

This great, really liked seeing. But imagine if it kept going all the way to a Q2 or something

[-]

ArugulaAnnual1765@reddit

I wonder how much better iq4_nl is than q4_k_m

[-]

SosirisTseng@reddit

Would like to know that, too. Unsloth only has the results of the MoE model: https://unsloth.ai/docs/models/qwen3.6#unsloth-gguf-benchmarks

[-]

ai_without_borders@reddit

useful benchmark but it's evaluating only one dimension of the quality-cost tradeoff. in practice the decision isn't just which weight quant, it's the joint allocation of your VRAM budget across weights, KV cache, and context. a Q4_K_M model with Q8_0 KV at 32k context has a very different quality profile than Q8_0 weights with Q4_0 KV at 16k context -- same hardware, wildly different operating points. the weight quant is usually the smaller quality hit compared to aggressive KV compression, which most evals skip. would be curious to see this extended with kv quant as a variable, especially at longer context lengths where the KV budget starts dominating.

[-]

MrMisterShin@reddit

Now include AWQ and FP8

[-]

Quirky_Inflation@reddit

That's just garbage in a graph

[-]

Equivalent-Ear-8016@reddit

Finally someone did the tests instead of guessing. I was tired of reading opinions on this sub without any substance behind them.

[-]

chitown160@reddit

and yet MXP8 and MXP4 are still slept on by Blackwell owners and also this benchmark.

[-]

xrvz@reddit

OP, do you also use the YYYY/DD/MM date format?

[-]

Eyelbee@reddit

How did you do the humaneval? Scores seem low

[-]

daily_spiderman@reddit

Agreed!

[-]

Pretend_Engineer5951@reddit

That's very strange results.

What kv cache quant did you use with llama.cpp? FYI: default f16 have an issue https://github.com/ggml-org/llama.cpp/issues/20035 . Unsloth recommends to use bf16 or q8.

And did you use base models or unsloth?

[-]

himefei@reddit

No one is questioning about how many years it took to complete these tests???

[-]

Look_0ver_There@reddit

What was the KV cache quantization used for each test?

[-]

Tagedieb@reddit

A similar comparison with different kv quants would also be useful.

[-]

ea_man@reddit

What bothers me the most with this release it the model size:

Now with QWEN3.6 you can't fit a Q4_k_m on a 16GB gpu and IQ3_XSS is borderline usable on a 12GB. Those are the smallest ones btw: https://huggingface.co/mradermacher/Qwen3.6-27B-i1-GGUF , unsloths quants are bigger.

3.5 was slightly smaller, I'd hope that next time they make like a \~24B version.

[-]

Ki1o@reddit

I'd love to see a benchmark that shows actual complex task completion with multi step + tool calls for these different models.

My instinct that I'd love to get data to prove is that minor reductions in quality from quantisations are more than made up for in increased token generation speed. Ultimately faster token output and rework feels like it would end up faster than slower token output (but high bench scores) plus rework

[-]

SmartCustard9944@reddit

Bad tokens in previous conversations pollute the context

[-]

bnolsen@reddit

q8_k_xl vs q8_0 ?

[-]

someone383726@reddit

Thanks for providing this service!

[-]

SmartCustard9944@reddit

You don’t mention who provides the quant. Also, would be interesting to measure hallucination rate, and tool calling accuracy, because it feels like these are some of the first things to go with quants.

[-]

Intelligent_Ice_113@reddit

this should be even better with dynamic quants and blinded model (to take up less RAM for code only tasks) 🤔

[-]

WhoRoger@reddit

Yass, this is much more useful than the synthetic KLD number. Q4 doing better than Q8 in some evals is interesting.

But I'd be careful about generalising the conclusions, especially since only Q4 and Q8 are compared here. Q6 may be the sweet spot with other models (especially the smaller ones). And then there's imatrix.

[-]

gvij@reddit (OP)

Complete Qwen 3.6 27B evaluation case study with benchmarking results, approach and code snippets are mentioned here: https://heyneo.com/blog/evaluating-qwen-3-6-27b-benchmarking-case-study

[-]

MelodicRecognition7@reddit

bullshit link: no code snippets, no plain text results, just an advertisement of that "Neo Engineer".

[-]

spaceman_@reddit

I was very interested but also couldn't find any code to reproduce the results of the study. Kind of defeats the purpose, because it's so easy to make procedural mistakes. Or have the LLM generated llama.cpp flags be wrong for one of the tests.

[-]

MelodicRecognition7@reddit

the first thing I wanted to do is to run this test on my hardware to verify the results because Q4 quant performing better than Q8 smells like AI hallucination.

[-]

Farmadupe@reddit

Yeah 100% this looks completely hallucinated. Q8_0 taking 42G vram is totally ballpark... BF16 taking 54G? No. Not even close. The weights on HF are bigger than 54G.

Also needs to tell us what version of llama.cpp and the subset of tests that was chosen. And to not synthesize "Q4" is best garbage right next to the charts.

Absolutely totally hallucinated.

[-]

spaceman_@reddit

Wow I didn't even notice that.

[-]

gvij@reddit (OP)

Mind your language. The blog might have missed adding the github repo link. Here it is: https://github.com/gauravvij/slm_eval_harness

Has the complete evaluation pipeline and steps.

[-]

Fedor_Doc@reddit

Qwen 3.5 35B A3B results fron github README show that Q4_K_M shows the best results in benchmarks compared to Q8_0 and BF16.

Have you analysed the results? Have quantization increased model performance?

[-]

pepedombo@reddit

Usually q8 is a bit slow, the gap seems to be very low though. I've found q5/q6 can loose some detail when prompted against q8 in coding. We need stronger benchmark which makes the differences more visible.

[-]

magnus-m@reddit

are these benchmarks multi-turn agent like?

[-]

estrafire@reddit

You should try kv caches too, q4_0, q4_1, q5_1 and q8_0

[-]

autonomousdev_@reddit

ran qwen 2.5 32b q4 vs q8 on my 3090 last week. q8 was like 10% better at keeping track of stuff over long context but it ate an extra 3gb of vram. for coding they felt the same honestly. just stuck with q4 and saved the headroom for tool calls

[-]

CheatCodesOfLife@reddit

bots love qwen 2.5 lmao //add 3 minute sleep to avoid rate limit

[-]

No_Dig_7017@reddit

This is awesome! Thanks for sharing!

[-]

UncleRedz@reddit

I'm missing the source of those quants, was it unsloth? Something else?

What's becoming very clear is that the old method of applying a quant across the board is not the way to do it anymore, some parameters are more important than others. This also means that how this quant was made, is very important for determining actual quality after quantisation.

Also the test samples here are unfortunately too small, 100 questions for each benchmark is not enough, you need to run the full benchmark. As an example, MMLU has something like 14.000 questions.

Last feedback, you are missing a failure counter, not just pass / fail on a test, but a third state, on did the model answer but it did not comply with instructions and answered in the wrong format or went off the rails? As a model is more heavily quantized this error state goes upp and can cause all sorts of unexpected issues, so its good to keep track of in any benchmarking.

[-]

PiaRedDragon@reddit

Try it again with t=0.3

[-]

Healthy-Nebula-3603@reddit

Nice

Thanks

[-]

ggGeorge713@reddit

Would love to see SWE Bench verified in there.

Any chance you tested that as well?

[-]

2Norn@reddit

all i need in my life is prismml to do the same ternary shit on 3.6 27b

[-]

fgp121@reddit

So I guess Q4_k_m is the best one in terms of hardware efficiency vs quality trade off?

[-]

One_Key_8127@reddit

Gemma 3 4B, released over 1 year ago, scores 71% on HumanEval. I think something is very much off with this result on HumanEval.

[-]

nunodonato@reddit

are these unsloth's quants?

[-]

Monad_Maya@reddit

What's the hardware setup other than the generic 32 vCPU and 125GB RAM? There are no details about how you measured throughout/TTFT etc and at what context size.

Additionally was the KV cache quantized?

[-]

gvij@reddit (OP)

Processor: 
    AMD EPYC 9655 (Genoa/Bergamo generation)
    vCPUs: 32 Cores
    Clock Speed: 2.0 GHz (Base)
    Virtualization: KVM Hypervisor (QEMU)
    Key Instruction Sets: AVX-512 (F, DQ, BW, VL, VNNI), SHA-NI, VAES.
    RAM: 125.8 GB (125 GiB)


Software/Inference Context
   OS: Linux (Ubuntu 24.04 LTS "Noble")
   Quantization: Q4_K_M (4-bit GGUF)
   Backend: llama-cpp-python (v0.3.20)
   Optimization: Compiled with AVX-512 support

No explicit KV cache quantization was applied.

[-]

Constandinoskalifo@reddit

Thanks for posting! Could you test the 35B one as well?

[-]

gvij@reddit (OP)

Yes, we have evaluated it already. Results should be out soon.

[-]

gvij@reddit (OP)

Sure, would be happy to experiment on the same and share the results soon on this 😄

[-]

TheCat001@reddit

Haha, nice, and after this Qwen funboys still gonna say my Q4 quants sucks? xD