Heavily quantized Q2 GLM5 vs less quantized Q8 minimax 2.5/Q4 Qwen3.5 397b?

Posted by ImpressiveNet5886@reddit | LocalLLaMA | View on Reddit | 9 comments

How would say the quality compares between heavily quantized versions of higher parameter giant models like GLM-5-UD-IQ2_XXS (241GB size) vs similarly sized but less quantized and fewer parameter models like MiniMax-M2.5-UD-Q8_0 (243GB) or Qwen3.5-397B-A17B-MXFP4_MOE (237GB)?

[-]

JockY@reddit

Never Q2 for anything you care about, they start bad and become utterly dreadful at long contexts.

If you have sufficient VRAM for a Q8 of MiniMax then you also have sufficient RAM for the full FP8 model.

You’d have to be smoking crack to use a GGUF when the native format of the model is FP8 and is well supported in vLLM.

[-]

ImpressiveNet5886@reddit (OP)

Sadly I don’t have enough vram to run things with vLLM :/ 40gb vram (3 gpus) with 256gb ddr4 ram, so I’ve been using llamacpp

[-]

AlexGSquadron@reddit

But I wonder if turbo quant would do anything in this case.

[-]

AlexGSquadron@reddit

That's crazy high amount, I am shocked 🤯. Anyways, I just rent the glm5.1. $270 yearly.

[-]

JockY@reddit

Yeah it’s a pretty sad state of affairs that the tech to run this stuff locally is out of reach for most people.

[-]

qubridInc@reddit

In most cases heavy quantization (Q2) hurts quality quite a bit. So a Q8 MiniMax or Q4 Qwen usually gives more reliable results than a huge model compressed to Q2, even if the original model is lar

[-]

EffectiveCeilingFan@reddit

I feel like this is a pretty classic question: high parameter count with small quant vs low parameter count with big quant. Here would be my initial guesses: I think Qwen3.5 MXFP4 would do the best. Q4 is a very good quantization level. Although, I think you should do UD-Q4_K_XL or IQ4_XL/NL. I've heard people having issues with MXFP4 Qwen3.5. I think MiniMax would come in second with GLM in third. I just don't think a Q2 can hold up in this arena. If you do any testing I'd be super interested in finding out, though!

[-]

LagOps91@reddit

wouldn't be so sure. huge models like that tend to quant gracefully. i think Qwen3.5 might be marginally better, but i wouldn't be surprised if Q2 GLM 5 beats Q8 Minimax M2.5 for most tasks.

[-]

Hanthunius@reddit

How about you do the testing and let us know?