Difference between Qwen 3.6 27b quants for vLLM
Posted by Blues520@reddit | LocalLLaMA | View on Reddit | 11 comments
Hi guys, I am trying to understand what is the difference between these quants to run in on dual 3090's.
First there is the official FP8: https://huggingface.co/Qwen/Qwen3.6-27B-FP8
Then I see this 6-bit AWQ: https://huggingface.co/QuantTrio/Qwen3.6-27B-AWQ-6Bit
And I see CyanWiki also has a quant up: https://huggingface.co/cyankiwi/Qwen3.6-27B-AWQ-BF16-INT4
They are all similar sizes so I'm unsure what to select. What is BF16-INT4 and will it perform faster on ampere but be less accurate then FP8?
cheabred@reddit
Following cause also running fp8 at the moment with dual 3090s as well... but wanted to switch how did it go?
Blues520@reddit (OP)
I ended up using a Q8 GGUFF which works decently. I would think the FP8 would be good, if not better. What context size are you running?
cheabred@reddit
About 128k roughly
Blues520@reddit (OP)
Yeah same on dual 3090's
cheabred@reddit
You running fp8? Im getting roughly 40 tokens /s I think.. (if I can read vllm output..) but when its reading code its doing like 1400 tokens /s 🤷♂️ seems Claude level speeds not quite quality. But for what I do its good enough
Blues520@reddit (OP)
I'm running a Q8 with llamacpp which is good enough so I haven't tried FP8
DeltaSqueezer@reddit
Go for the CyanWiki one, he keeps the linear layers in BF16 which makes a huge difference in output quality.
Glittering-Call8746@reddit
Perplexity vs the int4 ?
Blues520@reddit (OP)
I was wondering why that model has both BF16 and INT4 in the name but I think I understand now. Thanks!
Tormeister@reddit
Relevant: thread
pulse77@reddit
General rule: more bits (=bigger file size) is better.
For general tasks difference between 6-bit and 8-bit is very small, but for precise coding it matters.