KLD comparison of oQ, Q, MXFP and UD MLX quantizations

Posted by dpswt@reddit | LocalLLaMA | View on Reddit | 5 comments

[-]

Opening-Broccoli9190@reddit

Thank you for that! What do you think is the reason behind such a high KL for MXFP8? It's on the level of UD3, makes me wonder

[-]

dpswt@reddit (OP)

Unsloth had to retire MXFP4 for a good reason. It just blindly applies low BPW on way too important tensors.

But MXFP8 results still surprise me. It has group-size=32, and due to MXFP blockwise scale nature it somehow noticeably hurts the overall quality.

[-]

KLD does not measure everything. Still, if I had to speculate, I suspect MXFP8 is just naive quant which hits everything the same way. UD3 probably keeps the routers/most important parts of the MoE in higher precision which may be critical for MoE to work correctly. Eg it does not matter if your experts are quanted less (or not at all) when bad experts are chosen by router.

[-]

Beamsters@reddit

UD3/4's quality/size ratio are pretty wild. Here's the speed compare to oQ4. M1 Max. End to end results stay the same but oQ4 has clear advantage on token gen.

---

Benchmark Model: Qwen3.6-35B-A3B-UD-MLX-4bit

Test TTFT(ms) TPOT(ms) pp TPS tg TPS E2E(s) Throughput Peak Mem

pp1024/tg128 2062.7 17.37 496.4 tok/s 58.0 tok/s 4.269 269.8 tok/s 20.42 GB

pp4096/tg128 6817.8 18.61 600.8 tok/s 54.2 tok/s 9.181 460.1 tok/s 21.20 GB

pp8192/tg128 13608.3 19.91 602.0 tok/s 50.6 tok/s 16.137 515.6 tok/s 21.54 GB

pp16384/tg128 28287.5 23.99 579.2 tok/s 42.0 tok/s 31.334 527.0 tok/s 22.16 GB

pp32768/tg128 62226.1 33.13 526.6 tok/s 30.4 tok/s 66.434 495.2 tok/s 23.51 GB

---

Benchmark Model: Qwen3.6-35B-A3B-MLX-oQ4

Test TTFT(ms) TPOT(ms) pp TPS tg TPS E2E(s) Throughput Peak Mem

pp1024/tg128 2085.6 16.50 491.0 tok/s 61.1 tok/s 4.180 275.6 tok/s 20.08 GB

pp4096/tg128 6941.7 17.53 590.1 tok/s 57.5 tok/s 9.168 460.7 tok/s 20.85 GB

pp8192/tg128 13736.0 18.99 596.4 tok/s 53.1 tok/s 16.147 515.3 tok/s 21.20 GB

pp16384/tg128 28517.9 22.66 574.5 tok/s 44.5 tok/s 31.396 525.9 tok/s 21.82 GB

pp32768/tg128 62569.8 31.63 523.7 tok/s 31.9 tok/s 66.586 494.0 tok/s 23.16 GB

[-]

Mart-McUH@reddit

Well, for MoE, if you need to go that low bpw, you need to do it that way (picking which part you quant a lot and which not much) I suppose. The standard recipes do not work very well there anymore (unlike dense models).