Whats the best Qwen 27B Q8 quant?
Posted by EggDroppedSoup@reddit | LocalLLaMA | View on Reddit | 13 comments
everyone is talking about q 4 q 5 and q 6, but. i got some coding that i feel like lower quants kept getting wrong. I can run cute 8 from an sloth but feels a bit slow even with MTP ON, should I just resort to q8 35 b a3b at this point?
CodeDominator@reddit
The best is the one that fits in your VRAM, meaning you need at least 32GB of VRAM to be comfortable. This is coming from someone with 24GB, unfortunately.
tmvr@reddit
What hardware are you using? Do you have enough VRAM to fit in the Q8 version and the context?
xrmich@reddit
I'm running unsloth Qwen 3.6 35B A3B Q8_0 MTP version but there is something wrong with it, before the MTP I was running HauhauCS's uncensored Q8 K_P that is like impossible to break, long agentic codeflows but never managed to get it looping or break. Now this unsloth's MTP Q8_0 ends up looping and get derailed easily, like for example I had it making a podman container setup and it completely lost it, decided that it's not working like it should and started chasing other possibilities that makes absolutely no sense. I'm going to run the same with hauhau's to validate my thoughts later.. I don't know whether it's unsloth's quantization or the MTP that's a problem but there's a very noticeable degradation in the model's capability..
Oh and to your original question, I tried 27B Q8 on my Strix Halo but it's just too slow, didn't really notice any better quality compared to running 35B day and night (hauhau's)
mediaogre@reddit
Do you mean who makes to the best Q8 version, because the Q8 stands for 8-bit quantization.
JournalistLucky5124@reddit
Can I run q2 on 4gb vram and 16gb ram :)?
Additional-Ordinary2@reddit
Q2 is dumb as fuck
JournalistLucky5124@reddit
So what should I use 🥺🥺
ortegaalfredo@reddit
Qwen pulished its own Q8 quant, I doubt you can do better than them.
https://huggingface.co/Qwen/Qwen3.6-27B-FP8
taking_bullet@reddit
Not by choice. We are VRAM poor. Most folks have up to 32GB VRAM, so it's impossible to run Qwen 27B Q8 with decent context.Â
looselyhuman@reddit
Yep, 32GB club here. I'm looking at building around this one + 128k windows:
https://huggingface.co/rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm
soyalemujica@reddit
I personally stick to Q5KM, there's no point in going higher for a mere 2% difference that you really won't get involved with and if it happens, just prompt a second time to fix it.
FoxiPanda@reddit
You are maybe conflating/mixing up two things - speed and accuracy.
The different quantizations will have some speed differences but not insanely different between each one - so 27B-q4 and 27B-q5 will have similar speed but accuracy is the thing that changes most between quantizations.
But switching from 27B --> 35B-A3B you will notice significant differences in both speed (speed for token generation will go up ~8-9x [simple math here: activated tokens go from 27B to 3B --> 27/3 = 9x faster]... but you will also likely see a significant difference in overall capability but it depends on your use case.
You tell us nothing about your hardware, launch parameters, actual performance you're experiencing, or what the lower quants "kept getting wrong", or really anything to actually help diagnose your issue, so I'm not sure there's any useful answer to give.
Snoo_27681@reddit
Try this:
https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates