Heavily quantized Q2 GLM5 vs less quantized Q8 minimax 2.5/Q4 Qwen3.5 397b?
Posted by ImpressiveNet5886@reddit | LocalLLaMA | View on Reddit | 9 comments
How would say the quality compares between heavily quantized versions of higher parameter giant models like GLM-5-UD-IQ2_XXS (241GB size) vs similarly sized but less quantized and fewer parameter models like MiniMax-M2.5-UD-Q8_0 (243GB) or Qwen3.5-397B-A17B-MXFP4_MOE (237GB)?
__JockY__@reddit
Never Q2 for anything you care about, they start bad and become utterly dreadful at long contexts.
If you have sufficient VRAM for a Q8 of MiniMax then you also have sufficient RAM for the full FP8 model.
You’d have to be smoking crack to use a GGUF when the native format of the model is FP8 and is well supported in vLLM.
ImpressiveNet5886@reddit (OP)
Sadly I don’t have enough vram to run things with vLLM :/ 40gb vram (3 gpus) with 256gb ddr4 ram, so I’ve been using llamacpp
AlexGSquadron@reddit
But I wonder if turbo quant would do anything in this case.
AlexGSquadron@reddit
That's crazy high amount, I am shocked 🤯. Anyways, I just rent the glm5.1. $270 yearly.
__JockY__@reddit
Yeah it’s a pretty sad state of affairs that the tech to run this stuff locally is out of reach for most people.
qubridInc@reddit
In most cases heavy quantization (Q2) hurts quality quite a bit. So a Q8 MiniMax or Q4 Qwen usually gives more reliable results than a huge model compressed to Q2, even if the original model is lar
EffectiveCeilingFan@reddit
I feel like this is a pretty classic question: high parameter count with small quant vs low parameter count with big quant. Here would be my initial guesses: I think Qwen3.5 MXFP4 would do the best. Q4 is a very good quantization level. Although, I think you should do UD-Q4_K_XL or IQ4_XL/NL. I've heard people having issues with MXFP4 Qwen3.5. I think MiniMax would come in second with GLM in third. I just don't think a Q2 can hold up in this arena. If you do any testing I'd be super interested in finding out, though!
LagOps91@reddit
wouldn't be so sure. huge models like that tend to quant gracefully. i think Qwen3.5 might be marginally better, but i wouldn't be surprised if Q2 GLM 5 beats Q8 Minimax M2.5 for most tasks.
Hanthunius@reddit
How about you do the testing and let us know?