Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)
Posted by bobaburger@reddit | LocalLLaMA | View on Reddit | 170 comments
The following is a non-comprehensive test I came up with to test the quality difference (a.k.a degradation) between different quantizations of Qwen 3.6 27B. I want to figure out what's the best quant to run on my 16 GB VRAM setup.
WHAT WE ARE TESTING
First, the prompt:
Given this PGN string of a chess game:
1. b3 e5 2. Nf3 h5 3. d4 exd4 4. Nxd4 Nf6 5. f4 Ke7 6. Qd3 d5 7. h4 *
Figure out the current state of the chessboard, create an image in SVG code, also highlight the last move.
I want to see if the models can:
- Able to track the state of the board after each move, to reach the final state (first half of move 7)
- Generate the right SVG image of the board, correctly place the pieces, highlight the last move
And yes, if you are questioning. It could be possible that the model was trained to do the same thing on existing chess games, so I came up with some random moves, the kind of moves that no players above 300 elo would ever have played.
For those who are not chess players, this is how the board supposed to look like after move 7. h4. Btw, you supposed to look at the pieces positions and the board orientation, not image quality because this is just a screenshot from Lichess.

CAN OTHER MODELS SOLVE IT?
Before we go to the main part, let me show the result from some other models. I find it interesting that not many models were able to figure out the board state, let alone rendering it correctly.
Qwen 3.5 27B
It was mostly figured out the final position of the pieces, but still render the original board state on top. Highlighted the wrong squares, and the board orientation is wrong.

Gemma 4 31B
Nice chess dot com flagship board style, i would say it can figure out the board state, but failed to render it correctly. The square pattern also messed up.

Qwen3 Coder Next
I don't know what to say, quite disappointed.

Qwen3.6 35B A3B
As expected, 35B always be the fastest Qwen model, but at the same time, managed to fail the task successfully in many different ways. This is why I decided to find a way to squeeze 27B into my 16 GB card. The speed alone just not worth it.

HOW QWEN3.6 27B SOLVE IT?
All the models here are tested with the same set of llama.cpp parameters:
- temp 0.6
- top-p 0.95
- top-k 20
- min-p 0.0
- presence_penalty 1.0
- context window 65536
BF16 version was from OpenRouter, Q8 to Q4_K_XL versions was on a L40S server, the rest are on my RTX 5060 Ti.
The SVG code generated directly on Llama.cpp Web UI without any tools or MCP enabled (I originally ran this test in Pi agent, only to found out that the model tried to peek into the parent folders and found the existing SVG diagrams by higher quants, copied most of it).
BF16 - Full precision
This is the baseline of this test. It has everything I needed: right position, right board orientation, right piece colors, right highlight. The dotted blue line was unexpected, but it also interesting, because later on you will see, not many of the high quants generate this.

Q8_0
As expected Q8 retains pretty much everything from the full precision except the line.

Q6_K
We start to see some quality loss here. I mean the placement of the rank 5 pawns. The look of the pieces are mostly because Q6 decided to use a different font. None of the models here trying to draw its own pieces in this test.

Q5_K_XL
Looks very similar with Q8, but it is worth noticing that the SVG code of Q5 version is 7.1 KB, while Q8 is 4.7 KB.

Q4_K_XL and IQ4_XS
If you ignore the font choice, you will see Q4_K_XL is a more complete solution, because it has the board coordinates.

Q3_K_XL and Q3_K_M

IQ3_XXS
Now here's the interesting part, everything was mostly correct, the piece placements and the highlight, and there's the line on the last move!
But IQ3_XXS get the board orientation wrong, see the light square on the bottom left?

Q2_K_XL
This is just a waste of time. But hey, it got all the pieces positions right. The board is just not aligned at all.

SO, WHAT DO I USE?
I know a single test is not enough to draw any conclusion here. But personally, I will never go for anything below IQ4_XS after this test (I had bad experience with Q3_K_XL and below in other tries).
On my RTX 5060 Ti, I got like pp 100 tps and tg 8 tps for IQ4_XS with vanilla llama.cpp (q8 for both ctk and ctv, fit on). But with TheTom's TurboQuant fork, I managed to get up to pp 760 tps and tg 22 tps, by forcing GPU offload for all layers (`-ngl 99`), quite usable.
llama-cpp-turboquant/build/bin/llama-server -fa 1 -c 75000 -np 1 --no-mmap --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence_penalty 1.0 -ctk turbo4 -ctv turbo2 -ub 128 -b 256 -m Qwen3.6-27B-IQ4_XS.gguf -ngl 99
The only down side is I have to keep the context window below 75k, and use turbo4/turbo2 for KV cache quant.
Below are some example of different KV cache quants.


You can see all the result directly here https://qwen3-6-27b-benchmark.vercel.app/
Client_Hello@reddit
Gemma4 31B, Q4_K_M, and Q8_0 kv cache
5060 ti 16gb + 2070 Super 8gb, llama.cpp with fit-target 256 give 43k context, gen 16.5 tps, pulls 290 watts at the wall during gen
uhuge@reddit
It has the top right army man wrong?
Insomniac1000@reddit
I also use Q4_K_M but with Qwen 3.6 27b, Q8 K, Turbo 3 V so I can squeeze ij 262,000 context limit (but I make sure that I summarize the context length once it reaches about 180,000 tokens. I need to try out this test
slavap_@reddit
https://huggingface.co/DavidAU/Qwen3.6-27B-NEO-CODE-Di-IMatrix-MAX-GGUF/blob/main/Qwen3.6-27B-NEO-CODE-2T-OT-Q6_K.gguf
Shoddy-Tutor9563@reddit
interesting test, but I'm missing details:
- you compared different MODELs without saying a word of what quants did you use
- you mention that your tests weren't just one off observation, so I assume you did a series of tests - but again no further details
bobaburger@reddit (OP)
Thanks for the feedback, but what made you think this post is about comparing different models, and which part of the post made you think I did not mention what quants were tested?
Yes, I did not list all the generation result of each test, as other already mentioned, since temperature is 0.6, the result are not always the same, the varies in different way, and in this post, i only record the screenshot of the most complete output for each, think of it like best-of-5 result.
Shoddy-Tutor9563@reddit
Hey! First of all, thanks for your reply.
> what made you think this post is about comparing different models
Your words :) You're saying "models A, B and C didn't pass my test, but D did it!" (not exactly, but the idea). And then you're showing us how various quants of model D perform.
My question is - when you abandoned models A, B and C as "not smart enough" - what quants did you use?
> think of it like best-of-5 result
Good! This is already something! How were the rest 4 you didn't include? Were they passing your criteria or not? Ideally I wanted to see something like this: "out of 10 runs of quant XXX, 2 were brilliant - as below, 3 formally passed but had issues, 5 didn't pass at all"
Because if you just include the best run, it doesn't tell how consistent the model - does it produce the same good/bad result all the time, or it has issues with consistency and it produces good results (even for your one test) only in 20% of cases?
Joozio@reddit
IQ4_XS vs Q5_K_M is usually where I land for coding tasks. Q4_K_M saves VRAM but the formatting degradation shows up more in agentic use than in chat benchmarks. Anything requiring consistent structured output gets noticeably less reliable at Q4.
keen23331@reddit
daddywookie@reddit
Thanks for giving me a really interesting Saturday exploring all of this. I'm pretty low spec (8GB VRAM) so I'm fairly limited. All of the smaller models (4-9B) were pretty useless. The best result came from Qwen3.6-35B-A3B but with the temperature turned down to 0.2.
Basically, a lot of the other models were just overthinking everything. I tried rewriting the prompt in ChatGPT, using OpenCode and even trying to create scripts instead of direct SVG generation, and I now have a lot of very weird chess boards.
ilikesmellytofu@reddit
> BF16 has everything I needed: right position...
Missing F7 pawn. And this roughly lines up with my own testing with this exact prompt as I was trying to verify your results. Every quant will sometimes output a great result, but also a lot of the times make one or two mistakes that aren't immediately obvious. For example, unsloth Q4_K_XL spat out a perfectly correct result 4/10 runs in my testing. Q6_K_XL was marginally better at 5/10. But even BF16 had issues with consistently outputting a perfect result.
DOAMOD@reddit
My test 3.6 27B 3RUNS for Q4XL-Q5XL-Q6XL-Q8
Q4XL Run1 (not final run)
DOAMOD@reddit
Q8 Run 1
DOAMOD@reddit
Q8 Run 2
DOAMOD@reddit
Q8 Run 3
DOAMOD@reddit
Q6XL Run1
DOAMOD@reddit
Q6XL Run 2
DOAMOD@reddit
Q6XL Run3
DOAMOD@reddit
Q5XL RUN1 (not final)
DOAMOD@reddit
Q5XL Run2
DOAMOD@reddit
Q5XL Run3
DOAMOD@reddit
Q5XL Run3 runerrorsfix
DOAMOD@reddit
Q5XL RUN1 Autothinkfix1
DOAMOD@reddit
Q4XL Run1 autothinkfix final
DOAMOD@reddit
Q4XL Run2
DOAMOD@reddit
Q4XL Run3
Address-Street@reddit
Gemma 4 31B NVFP4, KV cache Q8_0, temp 1.0, top_k 64, top_p 0.95. Here’s my first try. All piece positions are correct, but the square pattern is messed up.
GGUF: CISCai/gemma-4-31B-it-NVFP4-turbo-GGUF.
fgp121@reddit
This is a really thorough and helpful eval - the chess puzzle approach is clever because it tests both reasoning and generation in a way that's hard to game. I recently came across a similar exhaustive benchmark of Qwen 3.6 27B done by Neo at heyneo.com if you want another data point to compare against.
Kaioh_shin@reddit
Qwen3.6-27B-NEO-CODE-HERE-2T-OT-IQ4_XS.gguf
kayox@reddit
What’s the hugging face url for this one?
Kaioh_shin@reddit
https://huggingface.co/DavidAU/Qwen3.6-27B-Heretic-Uncensored-FINETUNE-NEO-CODE-Di-IMatrix-MAX-GGUF/blob/main/Qwen3.6-27B-NEO-CODE-HERE-2T-OT-IQ4_XS.gguf
Pablo_the_brave@reddit
For me turbo3/turbo3 is much better now n this test then turbo4/turbo2... At least for https://huggingface.co/cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF
megakilo13@reddit
Very cool benchmark. And wow, Gemini Pro did it perfectly with the styling
wkernel@reddit
For experiment sake, tried it with a coding agent (pi-agent Qwen3.6-27B-UD-Q5_K_XL.gguf).
2 variants:
Thinking off. Produced a python script first to figure out the board position first.
Thinking High. Figured the board position itself using reasoning. Found a mistake, then corrected itself.
The results were pretty similar. And they both gave a terminal output at the end with a nice Unicode output of the board along with key observations of the board.
mncharity@reddit
And a tastefully missing pawn at f7?
bobaburger@reddit (OP)
ohhh, good catch, I missed that. seems like i did not grab the best one from the run for this version.
MatthKarl@reddit
Nice test. I was trying to replicate that and ran it on 3 local models I have.
- GPT-OSS-120B failed. The SVG didn't load as some comments were mal-formatted. Board orientation is fine though
- Gemma-4-31B got the SVG correct with all figures correct including the highlighting. However, the figures are a bit small in the fields
- Qwen-3.6-35B produced the nicest SVG, with nice figures filling the fields nicely. The pawn on e2 is missing though, and the numbering of the fields is offset by one field. And is states "After 7. h4* - White to move"
Guess I should be using Gemma-4 a bit more then now, although it was the slowest with some 5.5t/s
mncharity@reddit
I tried one run of Qwen3.6-35B-A3B-UD-Q6_K with coding parameters and 65k cache under pi, but suffixed the prompt with encouragement to be careful and double check your work. The result was pretty, though 7.6k verbose (css classes for square color, but not for size or positions). And it initially forgot a knight, only catching and fixing that when going back over the file to check it.
bobaburger@reddit (OP)
That sounds exactly like what I experienced with 35B, the results are nice and beautiful but always has errors.
Address-Street@reddit
What quants are you using for weights and KV cache? From what I know, Gemma is very sensitive to quantization.
MatthKarl@reddit
Uhm, I'm not 100% sure. I use the unsloth/gemma-4-31B-it-GGUF:UD-Q8_K_XL.gguf file from huggingface with q8_0 cache (I believe). Is that possible?
LocalAI_Amateur@reddit
Try https://github.com/spiritbuun/buun-llama-cpp you'll get more context out of it.
Interesting test. Thanks for sharing.
FatheredPuma81@reddit
https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4150413357 I would personally recommend that everyone wait until someone provides real benchmarks(like shown in the link. Not ppl or kld) that show this fork's implementation is better than Q4_1.
draconic_tongue@reddit
https://gist.github.com/Enferlain/30f3aa5e7e94b0696276b492fa190529
FatheredPuma81@reddit
:o Any chance you'd be willing to run Turbo3 and Q4_1?
draconic_tongue@reddit
Yeah, I'll run turbo3 but this model is very annoying with unlimited context, it takes hours. Idk about q4_1 though, don't really see the point. q4 should perform similar to tq4
FatheredPuma81@reddit
Sorry I didn't have time to look at more than the AIME results. The purpose would mostly be a sanity check? q8 and tq4 managing to appear lossless is one thing but if tq3 and q4_1 show the same lossless result in the benchmark then something is wrong (with what idk).
It's surprising that turbo4 is worse than q4_0 in the KLD and PPL tests though.
LocalAI_Amateur@reddit
A sound advice for sure. But if we were people of patience, we would not be here compiling llama.cpp forks and trying to squeeze out every last room for context.
I say, use it and test it. No amount of bench can replace how it performs in the real world.
bobaburger@reddit (OP)
thanks, i will take a look. i wonder what’s the difference between this and the tom’s fork. i’ve seen this mentioned a couple of times before.
Ell2509@reddit
How did you get it to generate images?
I am trying to use it on Linux, and multimodal will not work. Using text only version from hauhau works, and I have not asked about images because it supposedly does not have the image part.
Also, what ui are you using? I don't think I could find it. Are you doing straight curl into the command line? Or using something like Open Web-UI?
bobaburger@reddit (OP)
it’s llama.cpp web UI. the model generate SVG code and then i copy it to save as a file, its not generating images directly
ElChupaNebrey@reddit
It's crazy how bad is moe 35b model
pftbest@reddit
It's not bad, I think the OP set the temperature a bit low.
uptsi@reddit
Tried to one shot this in Claude using Opus 4.7 and 4.6... The results are worse than the qwen moe 35b
moahmo88@reddit
mudler/Qwen3.6-35B-A3B-APEX-GGUF I-Balanced can SOLVE IT.
External_Dentist1928@reddit
Even the full precision model couldn’t according to OP…
pftbest@reddit
not true, I get correct result with 35B-A3B even at 4 bits every time. Maybe there is some problem with the temperature parameters set by OP. For example for Gemma4 the manual says the temperature must be set to 1.0, I suspect thats why it failed the test
pftbest@reddit
The moe model generated the board correctly, even at 4 bits
unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_XL
Running on integrated graphics 780M at 14 tg/s
MotokoAGI@reddit
deepseekv4flash-UD-Q2
Evgeny_19@reddit
Very interesting test, thank you!
I think something is off with unsloth Q8. Here is the result of Q8_K_XL
Evgeny_19@reddit
This one is from Q5_K_XL
bobaburger@reddit (OP)
yeah look like Q5 always produce result with the same appearance with Q8, i cannot tell why
Evgeny_19@reddit
And Q6_K_XL looks almost perfect
MyOtherBodyIsACylon@reddit
If you’re able to run vllm, I’d be very curious to know how the cyankiwi AWQ BF16 INT4 does:
https://huggingface.co/cyankiwi/Qwen3.6-27B-AWQ-BF16-INT4
bobaburger@reddit (OP)
Thanks. I will try. But i’m afraid this gonna be too big for my card :D
MyOtherBodyIsACylon@reddit
Now that we have a result for this model, would you be willing to add it to your original post? This comparison is super helpful :-)
bobaburger@reddit (OP)
i'll try to get some time to update the post accordingly
dondiegorivera@reddit
Here you go:
Details: 2xRTX3090, 128GB DDR4, Ryzen 9 5950x - MTP=3, \~78 tok/s
vllm serve /data/models/Qwen3.6-27B-AWQ-BF16-INT4 \--host0.0.0.0\--port 8000 \--served-model-name qwen3.6-27b-awq \--tensor-parallel-size 2 \--max-model-len 131072 \--gpu-memory-utilization 0.92 \--max-num-seqs 2 \--max-num-batched-tokens 8192 \--enable-chunked-prefill \--enable-prefix-caching \--mm-encoder-tp-mode data \--mm-processor-cache-type shm \--trust-remote-code \--speculative-config '{"method":"mtp","num_speculative_tokens":3}'MyOtherBodyIsACylon@reddit
Thanks for running this.
Looks like it skipped adding the board coordinates? Otherwise appears like a combination of BF16 and Q8_0
dondiegorivera@reddit
Yes, it seems to be very strong. I tried also with disabled thinking, output degraded a lot.
dondiegorivera@reddit
I just deployed that exact model on my rig. Will give this test a try.
audioen@reddit
Single-shot tests are not very useful for grading models, except in coarsest terms. The model's output is probabilistic and you would need to get their "average output" in order to truly measure what the quantization damage is. This involves making like dozen output per quant per model, somehow grading them to identify what the "average" is, then comparing the average output of every model against each other.
With single-shot, you can be getting randomly a high quality output that is somewhere in, say, 90 % percentile of the model's ability spread, and end up comparing against 10 % percentile output of another quant, and this is probably enough to flip the ordering, and renders the results misleading. Single shot tests like these are able to reliably tell only very different quality or ability levels apart, and there is no obvious ways of ordering the results other than inspect it visually and see whether things are averaged, appropriately sized, and all features that are requested are present.
I'd recommend that you rather make the model just do math, like compute arithmetic that involves summing twenty 1-2 digit integers together. This is something where you can repeat the test many times, can grade it automatically for correctness as the answer is easy to verify, and difficulty can easily be changed by making the numbers bigger and the number of terms larger, in case it seems that all models are scoring 100 %.
bobaburger@reddit (OP)
thanks for the feedback. maybe the wording in the post makes it confusing. this is single test, but for each model i did generate about 5 different results. so it’s like 5-shots.
anobfuscator@reddit
How was the test/retest consistency?
bobaburger@reddit (OP)
the styling of the pieces varies for the same model between runs, but the placement, board patterns are always the same .
FatheredPuma81@reddit
Tbh all this post has done is reinforce my belief that 4 bit is the sweet spot, that 3 bit is very usable(despite what many say), and beyond 5 bit you're better off upgrading your model (if it's possible).
I'm sure this won't do anything about those that get upset when you compare much larger models at 3 bit(122b UD-Q3_K_XL) to smaller models at 4 bit(35B IQ4_NL).
Monad_Maya@reddit
This can be model dependent I believe. Some models don't respond too well to quantization.
FatheredPuma81@reddit
I know very little about the creation of LLMs but surely models that respond poorly are a thing of the past right? At this point model Quantization is so huge that all they're doing is shooting themself in the foot by not testing for this on a smaller scale before committing to training a new model.
MerePotato@reddit
Gemma 4 doesn't take well to quantization below 6 bit, particularly the MoE
Monad_Maya@reddit
Not always, for example - https://x.com/SkylerMiao7/status/2004887155395756057
Gemma4 also feels quite sensitive to quantisation. I haven't performed any exhaustive testing but you can notice the difference in tool calls especially if the KV cache is also highly quantized.
gpalmorejr@reddit
This is a part (but admittedly only a part) of the reason I didn't stick with Gemma 4 very far. MoE doesn't offload as well and the lower quants of Gemma 4 that I can use act weird and different. Whereas I can quantize Qwen3.5/3.6 down to IQ2 and for most simole stuff get an almost identical answer. For more complex stuff I can bump quants but retain 35B for the faster speed. If I jeed something mega accurate, I can run 27B but because I'm mostly CPU offload with it, I either have to use 27B at like IQ2 to get even 3.9t/s or I have to use a smaller model anyway. But the smaller Qwen models also show identical responses for simple prompt and stray from each other only small amount that seam to gain sort of linearly with the complexity of the request. So Qwen provides an ecosystem that is super easy to just load the model you need for a particular project and run it like grabbing a CD from a shelf and putting it into your stereo. So I just always stuck with Qwen even though I tried the others.
kombersninja2@reddit
Deep.
halbritt@reddit
Interestingly, I got better throughput with IQ_4XS than with Q3_K_M. I thought the 3-bit quant would be faster and I could give up some quality, but with ik_llama.cpp, I got \~167tok/s qith IQ_4XS, and \~100tok/s with Q3_K_M.
Fit_Split_9933@reddit
Here's a pure version of iQ4, smaller than the regular iQ4. Perhaps you could test it
https://huggingface.co/Ununnilium/Qwen3.6-27B-IQ4_XS-pure-GGUF
bobaburger@reddit (OP)
Thanks! I will take a look.
Pablo_the_brave@reddit
Could you also look at this. Also smaller but should be much better than pure: https://huggingface.co/cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF
TheBlueMatt@reddit
BF16 is missing the pawn on f7
hidden2u@reddit
am i crazy or did q3_k_m totally ace it?
bobaburger@reddit (OP)
for this test, kinda. for my other tries, it's not always as good. but if VRAM is tight, Q3_K_M could be a decent choice.
kombersninja2@reddit
This was very insightful, thank you very much.
Monad_Maya@reddit
Nice work, IQ4_XS is a good balance I feel. Works fine with q8 KV cache.
bobaburger@reddit (OP)
i feel the same way. even at tubo4 kv, it’s still very usable
Monad_Maya@reddit
Never tried lower than q8 KV or any turbo quant. I saw some tool call failures even at q8 but there was no empirical testing performed, just vibes.
Ok-Measurement-1575@reddit
Needs a tldr
tavirabon@reddit
TLDR only Qwen 3.6 is good at drawing svg, and somehow only bf16, q8 or q4, not q5/q6
-Ellary-@reddit
Right now I like to use IQ4XS and IQ3XXS for simple tasks that need speed and context.
IQ4XS is nice balance of size \ performance.
IQ3XXS is basically Q2 size quant but performance is way better.
So it is like `Daniel and cooler Daniel` meme.
bobaburger@reddit (OP)
I think it’s fair to put IQ3_XXS somewhere with Q3_K_S rather than Q2 :D
-Ellary-@reddit
I mean that Q2K is 12.6 gb IQ3XXS is 13 gb.
Q3KS is 14.3gb, way bigger.
DeepV@reddit
Way more unique than the pelican svg test.
Any plans on testing Prismascout? https://huggingface.co/rdtand/Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm
Azurasy@reddit
Qwen3.6-27B-int4-AutoRound
taoyx@reddit
I'm working on chess and LLMs this is very interesting thanks. I didn't even think about asking for SVG output.
-ScaTteRed-@reddit
Good job dude👍
bobaburger@reddit (OP)
Thanks!
ahmcode@reddit
Very cool way to test, tonmy opinion it's relevant ! I will use the svg generation idea to complete my "sudoku test" 😁
bobaburger@reddit (OP)
that could work! it’s way more complex than this anyway. i bet there will be a lot of fan noise and hundreds thousands of thinking tokens will be spent. :D
marscarsrars@reddit
This is amazing thank you
bobaburger@reddit (OP)
Thank you much
marscarsrars@reddit
Really this may not seem much to you or maybe you are humble but there is a lack of dynamic testing.
Current benchmarks have become nearly useless as they have become targets rather then something to be aspired to by people.
Consumerbot37427@reddit
I really like how evaluation is completely objective: Are the pieces in the right place? Is the board oriented correctly?
Probably not a great benchmark to compare completely different models, but for the purpose of comparing quants, it's great!
marscarsrars@reddit
Nevertheless a unique approach perhaps this can become a new benchmark.
NineThreeTilNow@reddit
That's a great quote.
You're looking for something that falls totally out of distribution.
bobaburger@reddit (OP)
exactly:)) why would any LLM train for anything like that on purpose. so it must be safe.
mfudi@reddit
That's awesome, thank you! Would be interesting to see the same for gemma4 variants
bobaburger@reddit (OP)
thank you!
cleversmoke@reddit
Amazing work! I really love this type of analysis, thank you! With this, I'll stick with Q5_K_M at 112k ctx and Q5_K_CL at 96k ctx. I noticed anything after ~90k ctx degrades so much with q8_0 KV cache.
bobaburger@reddit (OP)
thank you! and at around 90k up, i saw the speed drop a lot too.
RIP26770@reddit
That's the kind of benchmark we are all craving for 😂! Thanks for sharing bro.
bobaburger@reddit (OP)
thank you!
NoPresentation7366@reddit
Brillant post, thank you so much for this!
bobaburger@reddit (OP)
tysm!
Address-Street@reddit
Which quant did you use for Gemma 31B?
bobaburger@reddit (OP)
I use the API from OpenRouter. Could be BF16.
My_Unbiased_Opinion@reddit
I've been using UD IQ3XXS with 262K context. It's been great. It's far better than IQ4XS 35B with the same context. Q3 dynamic quants are pretty damn good.
bobaburger@reddit (OP)
Yes, I was using it for a month until I realized I can still get to around 20tps with IQ4_XS on my card.
Happythen@reddit
I bet that took some time to setup and run, thanks for that! Really interesting challenge for the different quants.
bobaburger@reddit (OP)
Thanks. I had to rent a cloud GPU for the big quants, but it’s fun. :D
marscarsrars@reddit
Mate I can hook you up with some speed if you want to run some more tests like this.
As long as you keep posting results and you can do a few fine tunes for me I won't charge anything:)
bobaburger@reddit (OP)
thank you so much!!! tbh i don’t do this regularly but the next time, i will definitely reach out!!!
Tartarus116@reddit
Awesome! We need more quant-level comparisons; KLD scores alone are not enough.
autonomousdev_@reddit
used q6_k for my coding agent setup and honestly the speed difference from q4 was barely there but it handled complex multi step prompts way better. iq3_xxs just hallucinates function calls nonstop in my experience. went back to q5_k_xl for the agent pipeline i put together at agentblueprint.guide and its a good middle ground
Consumerbot37427@reddit
Thanks for this!
Tried Qwen 3.5 397B @ IQ2_XXS and it had all kinds of mistakes.
Qwen 3.6 27B GGUF @ 8 bit was good, but the exact same in MLX had multiple mistakes.
I've always suspected MLX models have quality issues, and have avoided using them. This test seems to confirm that, albeit I only ran once each so far. With this model, MLX is a bit slower, too (15tps vs 17), so it's lose-lose.
ClearApartment2627@reddit
I wonder why Q6K fails to render the e2 pawn, while lower quants get that right. Sure, the model is probabilistic, but OP wrote he ran the tests several times.
flarenz@reddit
I used GPT Image 2.0.:
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Eyelbee@reddit
Great test, honestly. I'd be interested in making a spatial chess understanding benchmark, might be a good idea. We could create a chess moves dataset and get the model to generate the final board state for every task, then score the accuracy. We can request ASCII diagram or a FEN notation to see if the models can understand the final board state from the moves alone, then check deterministically. Could be a useful benchmark.
FrozenFishEnjoyer@reddit
As someone with a 5070 TI, what do you suggest I use? Also that turbo quant looks interesting, but can't you do that -99 flag with normal llama cpp?
bobaburger@reddit (OP)
yes, you can do it with q4/q4 for kv cache instead of turboquant, but I found the quality was worse than turbo4/turbo2 (you can see the last screenshots in the post). 5070 ti is faster than mine so I think you will get to 25 or 30 tps.
Fedor_Doc@reddit
What was llama.cpp version? Rotation (same principle as turboquant) was recently added to q/v cache quantization by default. I was under impression that it should be roughly comparable
grumd@reddit
Attention rotation is only a part of what makes turboquant work, so theoretically those turboquant forks can still be better than default llama cpp
bobaburger@reddit (OP)
i recompile 5 days ago. so yeah. with attn rot.
Flylink2@reddit
Normal llama.cpp doesn't manage turboquant if I am not wrong... so -99 will get out of memory error when you will try to load qwen27b in your 16go vram gpu with this context length
bobaburger@reddit (OP)
you can decrease batch size and ubatch size to get some more room too.
mission_tiefsee@reddit
Interesting. But why didnt you test Q4_K_M ?
bobaburger@reddit (OP)
mainly because there’s no way i can run anything larger than IQ4_XS on my 5060 Ti. And on my cloud L40S node, it’s faster to just try Q4_K_XL and up.
INT_21h@reddit
Whose quants did you use? Unsloth, Bartowski? This IQ4_XS popped up the other day & it's what I use on my 5060Ti. https://huggingface.co/cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF
bobaburger@reddit (OP)
i use Unsloth. Bartowski’s one was faster than Unsloth on my M2 laptop but on 5060, Unsloth is faster for me.
INT_21h@reddit
Here's what cHunter789's IQ4_XS managed.
bobaburger@reddit (OP)
wow. that looks great
twack3r@reddit
Thanks for putting in the work!
Did you test Model quantisation vs kvcache quantisation? I have personally become far more reluctant to use anything other than 16-bit for kvcache. I keep that as a constant and select the Quants as a variable to match my ctx demand and VRAM constraint.
bobaburger@reddit (OP)
thanks. these was done with bf16 kv cache. additionally for 4 bits and 3 bits I did try different kv cache quants as well.
roofkid@reddit
I love this! It‘s so cool to see everything so visually.
One thing I have been wondering: what would happen if you had a control/qa loop in place, I mean a prompt a little more elaborate than: „look at this screenshot and fix any deviation from the original requirements“. I would be very curious if there are quants that cannot arrive at the correct solution even with a feedback loop.
My thought is that one shotting is awesome - at the time with enough speed I would also be OK if it just takes a little longer, especially if you‘re VRAM constrained. Even on big VRAM systems the lower quants are a lot faster so I wonder if the total time taken will actually be higher or lower in the end.
bobaburger@reddit (OP)
yeah i’ve thought about this. for this particular example, maybe the models will eventually fix it correctly, depends on how you prompt it. maybe lower quants will take more turns, and it will break something along the way when fixing stuff.
but the quality difference will be reflected better if we use it for tasks like researching, planning. it’s likely they will miss out some important details or something similar.
MrPecunius@reddit
Very cool test and results presentation, thank you!
bobaburger@reddit (OP)
thank you!
Blues520@reddit
Great test to illustrate the accuracy visually
bobaburger@reddit (OP)
thanks!
jacek2023@reddit
Great work, congratulations on testing real use case and various quants. I just hope you tested them multiple times.
bobaburger@reddit (OP)
thank you! yes. i did run multiple times for each test.
moahmo88@reddit
Good job!Thanks for sharing!
bobaburger@reddit (OP)
thanks!
rm_rf_all_files@reddit
Try bigcodebench. Wonder if you can go above 33.
bobaburger@reddit (OP)
33 tps? that’s the speed i got for iq3_xxs, but not for any quants above that.
rm_rf_all_files@reddit
no, the score
mr_Owner@reddit
Amazing test! Could you make a ranked list of your findings?
From what i understood the iq3_xxs is the best small one without breaking too much?
bobaburger@reddit (OP)
thanks. i tried to avoid drawing any conclusions because it just one test.
iq3_xxs actually get the board orientation wrong (bottom left square should be dark not light). so i would not recommend it.
Raredisarray@reddit
Very interesting !! Thanks for sharing. I’ll definitely stick with q8
FoxiPanda@reddit
Full disclosure: I skimmed this because it's super long.
Did you run each test only once or did you do multiple takes to get a sense of whether any one run was an outlier? I've found in general that 'One run is not enough' to determine actual quality - you end up with statistical noise that can make you believe a result that is just not true (though I will say looking through the images, there is a trend line in quality degradation that one would expect)
bobaburger@reddit (OP)
Yeah, I did run multiple times for each test, that's why I note about the font choice in the post, because they varies, but things like pieces position and placement on the board are pretty much the same between runs.
FoxiPanda@reddit
Cool, just making sure. It's a neat experiment, thanks for running it (I am sure it took some time to do lol)
bobaburger@reddit (OP)
Thanks! took me about 3 days something for this but it was fun 😂