Difference between Qwen 3.6 27b quants for vLLM

Posted by Blues520@reddit | LocalLLaMA | View on Reddit | 11 comments

Hi guys, I am trying to understand what is the difference between these quants to run in on dual 3090's.

First there is the official FP8: https://huggingface.co/Qwen/Qwen3.6-27B-FP8

Then I see this 6-bit AWQ: https://huggingface.co/QuantTrio/Qwen3.6-27B-AWQ-6Bit

And I see CyanWiki also has a quant up: https://huggingface.co/cyankiwi/Qwen3.6-27B-AWQ-BF16-INT4

They are all similar sizes so I'm unsure what to select. What is BF16-INT4 and will it perform faster on ampere but be less accurate then FP8?

[-]

cheabred@reddit

Following cause also running fp8 at the moment with dual 3090s as well... but wanted to switch how did it go?

[-]

Blues520@reddit (OP)

I ended up using a Q8 GGUFF which works decently. I would think the FP8 would be good, if not better. What context size are you running?

[-]

You running fp8? Im getting roughly 40 tokens /s I think.. (if I can read vllm output..) but when its reading code its doing like 1400 tokens /s 🤷‍♂️ seems Claude level speeds not quite quality. But for what I do its good enough

[-]

Blues520@reddit (OP)

I'm running a Q8 with llamacpp which is good enough so I haven't tried FP8

[-]

DeltaSqueezer@reddit

Go for the CyanWiki one, he keeps the linear layers in BF16 which makes a huge difference in output quality.

[-]

Glittering-Call8746@reddit

Perplexity vs the int4 ?

[-]

Blues520@reddit (OP)

I was wondering why that model has both BF16 and INT4 in the name but I think I understand now. Thanks!

[-]

Tormeister@reddit

Relevant: thread

[-]

pulse77@reddit

General rule: more bits (=bigger file size) is better.

For general tasks difference between 6-bit and 8-bit is very small, but for precise coding it matters.