Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

Posted by bobaburger@reddit | LocalLLaMA | View on Reddit | 170 comments

The following is a non-comprehensive test I came up with to test the quality difference (a.k.a degradation) between different quantizations of Qwen 3.6 27B. I want to figure out what's the best quant to run on my 16 GB VRAM setup.

WHAT WE ARE TESTING

First, the prompt:

Given this PGN string of a chess game:

1. b3 e5 2. Nf3 h5 3. d4 exd4 4. Nxd4 Nf6 5. f4 Ke7 6. Qd3 d5 7. h4 *

Figure out the current state of the chessboard, create an image in SVG code, also highlight the last move.

I want to see if the models can:

Able to track the state of the board after each move, to reach the final state (first half of move 7)
Generate the right SVG image of the board, correctly place the pieces, highlight the last move

And yes, if you are questioning. It could be possible that the model was trained to do the same thing on existing chess games, so I came up with some random moves, the kind of moves that no players above 300 elo would ever have played.

For those who are not chess players, this is how the board supposed to look like after move 7. h4. Btw, you supposed to look at the pieces positions and the board orientation, not image quality because this is just a screenshot from Lichess.

CAN OTHER MODELS SOLVE IT?

Before we go to the main part, let me show the result from some other models. I find it interesting that not many models were able to figure out the board state, let alone rendering it correctly.

Qwen 3.5 27B

It was mostly figured out the final position of the pieces, but still render the original board state on top. Highlighted the wrong squares, and the board orientation is wrong.

Gemma 4 31B

Nice chess dot com flagship board style, i would say it can figure out the board state, but failed to render it correctly. The square pattern also messed up.

Qwen3 Coder Next

I don't know what to say, quite disappointed.

Qwen3.6 35B A3B

As expected, 35B always be the fastest Qwen model, but at the same time, managed to fail the task successfully in many different ways. This is why I decided to find a way to squeeze 27B into my 16 GB card. The speed alone just not worth it.

HOW QWEN3.6 27B SOLVE IT?

All the models here are tested with the same set of llama.cpp parameters:

temp 0.6
top-p 0.95
top-k 20
min-p 0.0
presence_penalty 1.0
context window 65536

BF16 version was from OpenRouter, Q8 to Q4_K_XL versions was on a L40S server, the rest are on my RTX 5060 Ti.

The SVG code generated directly on Llama.cpp Web UI without any tools or MCP enabled (I originally ran this test in Pi agent, only to found out that the model tried to peek into the parent folders and found the existing SVG diagrams by higher quants, copied most of it).

BF16 - Full precision

This is the baseline of this test. It has everything I needed: right position, right board orientation, right piece colors, right highlight. The dotted blue line was unexpected, but it also interesting, because later on you will see, not many of the high quants generate this.

Q8_0

As expected Q8 retains pretty much everything from the full precision except the line.

Q6_K

We start to see some quality loss here. I mean the placement of the rank 5 pawns. The look of the pieces are mostly because Q6 decided to use a different font. None of the models here trying to draw its own pieces in this test.

Q5_K_XL

Looks very similar with Q8, but it is worth noticing that the SVG code of Q5 version is 7.1 KB, while Q8 is 4.7 KB.

Q4_K_XL and IQ4_XS

If you ignore the font choice, you will see Q4_K_XL is a more complete solution, because it has the board coordinates.

Q3_K_XL and Q3_K_M

IQ3_XXS

Now here's the interesting part, everything was mostly correct, the piece placements and the highlight, and there's the line on the last move!

But IQ3_XXS get the board orientation wrong, see the light square on the bottom left?

Q2_K_XL

This is just a waste of time. But hey, it got all the pieces positions right. The board is just not aligned at all.

SO, WHAT DO I USE?

I know a single test is not enough to draw any conclusion here. But personally, I will never go for anything below IQ4_XS after this test (I had bad experience with Q3_K_XL and below in other tries).

On my RTX 5060 Ti, I got like pp 100 tps and tg 8 tps for IQ4_XS with vanilla llama.cpp (q8 for both ctk and ctv, fit on). But with TheTom's TurboQuant fork, I managed to get up to pp 760 tps and tg 22 tps, by forcing GPU offload for all layers (`-ngl 99`), quite usable.

llama-cpp-turboquant/build/bin/llama-server -fa 1 -c 75000 -np 1 --no-mmap --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence_penalty 1.0 -ctk turbo4 -ctv turbo2 -ub 128 -b 256 -m Qwen3.6-27B-IQ4_XS.gguf -ngl 99

The only down side is I have to keep the context window below 75k, and use turbo4/turbo2 for KV cache quant.

Below are some example of different KV cache quants.

You can see all the result directly here https://qwen3-6-27b-benchmark.vercel.app/

[-]

Client_Hello@reddit

Gemma4 31B, Q4_K_M, and Q8_0 kv cache

5060 ti 16gb + 2070 Super 8gb, llama.cpp with fit-target 256 give 43k context, gen 16.5 tps, pulls 290 watts at the wall during gen

[-]

uhuge@reddit

It has the top right army man wrong?

[-]

Insomniac1000@reddit

I also use Q4_K_M but with Qwen 3.6 27b, Q8 K, Turbo 3 V so I can squeeze ij 262,000 context limit (but I make sure that I summarize the context length once it reaches about 180,000 tokens. I need to try out this test

[-]

slavap_@reddit

https://huggingface.co/DavidAU/Qwen3.6-27B-NEO-CODE-Di-IMatrix-MAX-GGUF/blob/main/Qwen3.6-27B-NEO-CODE-2T-OT-Q6_K.gguf

[-]

Shoddy-Tutor9563@reddit

interesting test, but I'm missing details:
- you compared different MODELs without saying a word of what quants did you use
- you mention that your tests weren't just one off observation, so I assume you did a series of tests - but again no further details

[-]