Qwen3.6-35B-A3B KLDs - INTs and NVFPs
Posted by Phaelon74@reddit | LocalLLaMA | View on Reddit | 21 comments

KLD for INTs and NVFP4s.
AS ALWAYS - Use Case is important. Accuracy versus speed versus native kernels on your GPUs.
Things to note again:
- This is done in VLLM, with REAL logits. My Repo (https://github.com/phaelon74/vllm/tree/feature/score-mode-ppl-kld) has made changes in the VLLM "hot path", so it's real, it's on GPU, and it's \~3-5 minutes on RTX 6000s
- KLD does not lie, it's just raw math against Logits
- KLD tells a story of divergence.
- Evals are still important, for use-case specific
- A quant can have a worse KLD and get a better eval on a test versus a better KLD quant. This is bench maxing, and it's real. Choose the Quant for your Use-Case.
- FP8 has worse quality than INT8
- This is expected, as W8A8 has activations at 8
- FP8 (W8A8) should stay in 8bit, meaning it should be faster than INT8
- The NVFP4 cake, as always, is a lie.
- But similar to FP8, NVFP4 (W4A4) should stay in FP4 and "should" be faster than an INT4
- NVPF4A16 has activation of 16, and will generally have a higher quality/accuracy than NVFP4A4, but remember, this may come at a cost.
FullOf_Bad_Ideas@reddit
Fantastic work, I've seen the PRs you made in vllm and discussions in llm-compressor just a few hours ago randomly lol.
What's your opinion of comparing REAP/REAM NVFP4 models to BF16 original models using KLD? Is it heresy or it should still be a a meaningful quality metric?
Phaelon74@reddit (OP)
If it's a BF16 REAP/REAM to a base model, I think it's a fair pull. BF16 model to model, should be 0.0 KLD, but REAP/REAM do change a lot, so that KLD is going to be skewed, and since it's logits, I'm not sure how well it can be used to say "This model has diverged so far, it's unrecoverable" but I might try it and see what it looks like.
CockBrother@reddit
Would really appreciate seeing a comparison like this with an Nvidia quantized and released NVFP4 model.
The tools and the process seem to have a very large impact on quality. Is Nvidia even getting it right?
Unfortunately they have not released this model.
Phaelon74@reddit (OP)
Nvidia fixed their Qwen3.5-397B-A13B model, after we pointed out how poor of a model that was, but only in terms of mixed precision, etc.
Nvidia HAS released some QAD models, but they are rare, since the amount of resources needed are massive. I did this with Llama-3.1-8B-IT. I'll take a look and see if they have any models that are relevant still and have undergone QAD.
suicidaleggroll@reddit
What's the deal with NVFP4? It was supposed to be:
Near lossless
Super fast
From what I've seen in these and other results, it looks like it's neither of those things. Quality seems similar, if not slightly worse than traditional Q4, and it's the same speed or slower. Are all of these just bad quants?
Phaelon74@reddit (OP)
The answer lies in Activations, and things Nvidia does not tell you. We can have Beers/Whiskey and I can regal you with my stories of war on NVFP4. Here's your TLDR: You have to do special stuff after you NVFP4 quant a model, to bring BACK intelligence. Nvidia publishes this via papers but doesn't "mention it" cause that runs the guise, etc.
If you want the absolute best NVFP4, look for QAD. If it does not have QAD look for NVFP4A16, but know it might be slower.
- NVFP4 base scheme is Weights 4bit, activations 4bit. Activations to 4bit makes every model dumber, and thus it requires additional things to make it "smart again"
- If you QAD (Quantization aware distillation) a model, after being NVFP4, you can bring it back to \~1% of FP8. This requires IMMENSE amounts of GPUs, as you are basically training a model again. For Llama3.1-8B-IT, it took \~190GB of VRAM. For a 120B model, you would need 32 A100 cards, running for \~3 weeks.
- NVFP4 runs all in 4 bit, so it should be "way faster" than something like INT4, which is W4A16, which has weights in 4bit and activations in 16bit.
jinnyjuice@reddit
At this point, I've just given up on NVFP4.
I'll just continue to trust QuantTrio's AutoRound quants. I just wish they worked with SGLang though.
Phaelon74@reddit (OP)
As for intel, I missed it, will add it tonight, apologies.
jinnyjuice@reddit
Now that I check it, never mind! It seems that they actually took down the 35B model.
Also, no need to apologise, you don't owe anyone anything! If anything, thanks for sharing.
Unknown_New_God@reddit
https://www.reddit.com/r/LocalLLaMA/s/CuK05tqQaV
r0kh0rd@reddit
Thanks for sharing this. More stuff to add to my list to learn about!
gh0stwriter1234@reddit
Eh its NV tech of course there is hype...
DistanceAlert5706@reddit
Any chance to get same for 27b?
Unknown_New_God@reddit
https://www.reddit.com/r/LocalLLaMA/s/CuK05tqQaV
DistanceAlert5706@reddit
Cool, thanks!
Interesting to see KLD of sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP but idk how to calculate it
dtdisapointingresult@reddit
Great post, keep these coming.
Can you do Gemma 4 31B? It's the best writing and language model (that doesn't require like 400GB VRAM), would be nice to know the best quant to get. Although based on all your posts, QuantTrio's AWQ 4-bit seems like a great pick no matter the model.
JC1DA@reddit
this is awesome, do you have results for Qwen3.6-27B models as well?
Unknown_New_God@reddit
https://www.reddit.com/r/LocalLLaMA/s/CuK05tqQaV
neochron@reddit
I've heard a little about the active parameters (A3B) models. Am I correct in assuming they don't work on Mac with unified RAM if the amount of RAM is less than the parameter count?
CockBrother@reddit
I'd really appreciate seeing a similar comparison with an nvfp4 provided by Nvidia.
NVFP4 quality is going to be highly dependent on tools and process. I'd like to see whether Nvidia themselves are "doing it correctly".
Unfortunately they haven't published this model.
BitGreen1270@reddit
I'm a noob and so all of this went over my head. Ironically, I used Qwen3.6-35B-A3B to explain in layman's terms and it makes so much sense. But looking at the chart, the mmangkad NVPF4 seems not too bad since it's lower? Or is it still not that good?