Is a heavily quantised Q235b any better than Q32b?
Posted by Secure_Reflection409@reddit | LocalLLaMA | View on Reddit | 42 comments
I've come to the conclusion that Qwen's 235b at Q2K~, perhaps unsurprisingly, is not better than Qwen3 32b Q4KL but I still wonder about the Q3? Gemma2 27b Q3KS used to be awesome, for example. Perhaps Qwen's 235b at Q3 will be amazing? Amazing enough to warrant 10 t/s?
I'm in the process of getting a mish mash of RAM I have in the cupboard together to go from 96GB to 128GB which should allow me to test Q3... if it'll POST.
Is anyone already running the Q3? Is it better for code / design work than the current 32b GOAT?
Sabin_Stargem@reddit
A thing to keep an eye on, is Cognitive Computer's enlarged versions of Qwen3 32b that include a distillation of Qwen 235b. Right now, they have a checkpoint of Qwen3 58b, Stage 2. Hopefully the final version of these 58b and 72b models would be worth using.
https://huggingface.co/cognitivecomputations/Qwen3-58B-Distill-Stage2
perelmanych@reddit
Unfortunately cognitivecomputations are screwed. HF and GH pages are removed. 😒
Sabin_Stargem@reddit
They had a rebranding. They are now QuixAI.
https://huggingface.co/QuixiAI/Qwen3-58B-Distill-Stage3
perelmanych@reddit
Good to know, thanx!
silenceimpaired@reddit
I hope they stick to Apache licensing
MaxKruse96@reddit
qwen3 is extremely sensitive to quant for some reason, so the higher u go, the (un)proportionally better it gets. Testing specific quants of different sizes against each other is something so insanely compute heavy, i dont think anyone does that.
Secure_Reflection409@reddit (OP)
We've had a few people do it here in the past with MMLU-Pro but I do wonder if there's a less compute intensive way to do it.
MMLU-Pro is arguably not a good enough proxy for codegen / design, either.
No way around burning millions of tokens, perhaps and if you're at home doing it yourself on your own kit, tens++ of hours of your time, too.
DepthHour1669@reddit
Just PPL and KLD and delta Probs. good ole barty made a good post on this a while back
https://www.reddit.com/r/LocalLLaMA/comments/1jvlf6m/llama_4_scout_sub_50gb_gguf_quantization_showdown/
Don't trust PPL numbers, those are often weird, esp with gemma quants. MMLU and GPQA are the easiest full e2e benchmarks. Very compute heavy though.
Secure_Reflection409@reddit (OP)
Thanks for this.
Caffdy@reddit
Just leaving this here. I know is not Qwen3, but I think is relevant for it as well.
TL;DR: Dynamic Quants can perform as good as Q4 with generous savings in memory
a_beautiful_rhind@reddit
That's strange because they posted graphs on how it wasn't affected so much, down to even low quants.
DepthHour1669@reddit
That's weird. Link?
Secure_Reflection409@reddit (OP)
Quick update, only 5 t/s with my mismatched memory (2 x 48, 2 x 16) which is too slow and wasn't prime stable. I get 11.7 t/s on the Q2.
My options appear to be:
Baldur-Norddahl@reddit
I am running Qwen3 235b at q3 on my 128 GB M4 Max MacBook Pro. It is the best model and the last resort before going cloud. But I would not call it amazing. It is no DeepSeek R1.
redoubt515@reddit
What speeds are you seeing with that setup?
Baldur-Norddahl@reddit
It is surprisingly fast at 20 tps.
Secure_Reflection409@reddit (OP)
Nice.
Which / Who's quant are you using exactly?
mxforest@reddit
Try DWQ. It is dynamic 3-6 bits. I run it on 40k context without a problem on my m4 max.
Baldur-Norddahl@reddit
I need 128k context for my use case. I am using RooCode and the standard system prompt eats up a lot of space, so the models with 40k context feels too limited.
LA_rent_Aficionado@reddit
For real, the Cursor system prompt is 15k, I imagine Roo is similar
mxforest@reddit
Makes sense. I was mostly commenting on the most capable model that can be run on it with a usable context. 40k is plenty for a lot of purposes because prompt processing is ass anyway.
Baldur-Norddahl@reddit
The large system prompt gets cached and reused, so it is not so bad with regards to prompt processing speed.
EmergencyLetter135@reddit
Thank you for your kind advice. Do you know a comparison with results between the GGUF and MLX models of Qwen 235B? The background is that I had the subjective impression that all MLX models I had tried could not keep up with the output quality of Unsloth. I even found the Q2 from Unsloth better than a 3-bit MLX.
Caffdy@reddit
can confirm, best model that fits 128GB, R1 in dynamic quant needs over 140GBs
ZBoblq@reddit
What are the weak points?
djdeniro@reddit
you can check :oobabooga.github.io/benchmark.html and get true info about it
Secure_Reflection409@reddit (OP)
What's the benchmark on?
Healthy-Nebula-3603@reddit
No
Few-Yam9901@reddit
I think Unsloth Q6_K_XL 32b is better than 235b Q5?
tempetemplar@reddit
Q3 is still good for my use cases! I've even tried to make some insane exercise of using iq2_xxs of qwen3 32b. To reduce insanity, you have to use many tools. Otherwise, well, you've got totally insane results 😂
__JockY__@reddit
I run the official Qwen3 235B A22B INT4 GPTQ quant in vLLM using Qwen’s recommended settings.
It’s fabulous for coding and technical work. I love it. Destroys Qwen2.5 72B 8bpw exl2 in all my use cases.
However it drops off quickly at larger contexts. Once you get past ~ 16k it gets significantly dumber, makes syntax mistakes, etc. close to 32k tokens and it’s pretty bad.
But working inside that first 16k feels like I have a SOTA model right next to me. Fantastic.
No_Shape_3423@reddit
I've tested a number of local LLMs and quants for document work with long detailed prompts that generate or grade long documents placed in the ctx (4x3090). My observation generally is that the impact of quantization is undersold. It may be fine for your use case, but not for mine. The first thing to go is IF. BF16 is better than Q8, which may or may not get it done. By the time I get to Q4, IF becomes useless for my workflow. Qwen3 235b Q3KL could not IF enough to be useful. FWIW the consistent winners on my rig with sufficiently long ctx were Qwen3 32b BF16/Q8, QwQ BF16/Q8, and Qwen2.5 70b (Athene v2) Q8. llama 3.3 70b Q8 would IF but even at Q8 didn't have enough smarts to be useful. Qwen3 30ab BF16 128k ctx is my daily driver.
Secure_Reflection409@reddit (OP)
That's great info, thanks.
DemonsHW-@reddit
I used different Q5 and Q4 quants and they were extremely bad for code generation. It would produce a lot of syntax errors in the generated code and would go into a infinite loop generating random tokens.
Even DeepSeek-R1 with TQ1_0 quant performed better
Lissanro@reddit
Qwen3 is MoE trained at high 16-bit precision, which makes it quite sensitive to quantization - more so than DeepSeek R1 which even though is MoE too but was trained at FP8 precision (MoE are more sensitive to quantization in general because they only use part of their parameters at a time, unlike dense models).
I cannot recommend going below IQ4 even with R1 because I notice quality degradation beyond that point (I downloaded the original FP8 version of R1 and tested few quants: IQ3, IQ4 and Q8), and for Qwen3 I would recommend at least Q6 or Q8. This is actually the main reason why I ended up not using it much beyond some testing... At Q8 it still behind R1 IQ4_K_M in many areas, including general coding, creative writing and agentic workflows, while not being much faster. So I just use R1 0528 as my daily driver.
If you are memory limited, but still have enough to run Qwen3 at Q2K, then using Qwen3 32 Q8 may be a good alternative, especially if you do programming and need the best accuracy. New Mistral Devstral 2507 24B may be another alternative to try if you are looking for a lightweight model.
smflx@reddit
I agree. R1 is better & not much slower in token generation. But, prompt processing of Qwen3 235B q4 is quite faster.
I also tested Qwen3 235B q4 is better than Qwen3 32B Q8
Karim_acing_it@reddit
I run the IQ4_XS on 128GB DDR5 at a mere 3 tps in LM Studio, just made a post. Do you have any questions specifically? I personally couldn't witness a big increase in quality from Qwen 32B to Qwen 235B in my very initial testing, but it were most generic prompts too.
Secure_Reflection409@reddit (OP)
Awesome!
The more people we got testing this the better. Slow internet connection here so bear with me :)
a_beautiful_rhind@reddit
I can use it at Q4. Seems samey between it and exl3 3.0bpw where I can fully offload it. The API on OR gets a couple of things slightly more right but that's about all.
You're running into the MoE problem of low active params. For me around ~30b is the cutoff where models start getting decent. 235b is not quite there but almost, the other params don't necessarily make up for it. Training data between them all was similar and so you have cases where the 32b does comparably. You wonder what all that extra memory is for.
Even deepseek makes ~30b style mistakes. Truly has a lot of knowledge in those params so it's less likely. These smaller MoE don't have that luxury or as good training data. 235b has all the stem stuff but not all the intelligence. Code and conversations need the latter. They force the model to generalize.
That we're even having this conversation shows it's not a free lunch of model go fast.
Red_Redditor_Reddit@reddit
I use 2q from unsloth. It's better than the 32b, but it also uses like 6x the memory the 32b @ 4q does. The main advantage is speed. If it wasn't that, there's more bang for your RAM with other dense models.Â
Zestyclose_Yak_3174@reddit
It feels like even unsloth's Q2_K is quite decent. I do think it has more to pull from and rooted in more real world knowledge VS Qwen 3 32B, however the difference might become really noticeable when accuracy is a must: coding, classification, etc
uti24@reddit
I tried Qwen 235B at Q2_K; it's definitely not lobotomized at that point, it performed quite well in my test.
I guess it might even be better than the 32B at Q8, but that would require a deeper comparison, which I haven’t done.