If you're using Nvidia's NVFP4 of Qwen3.5-397, try a different quant | TheaterFire

If you're using Nvidia's NVFP4 of Qwen3.5-397, try a different quant

Posted by Phaelon74@reddit | LocalLLaMA | View on Reddit | 63 comments

If the quant is working well for you, awesome. It's KLD is quite divergent, and that translates to real intelligence lost. The larger the model, the less this is visible, so if you don't see it, rocksauce. if you do, try Sehyo's NVFP4 or Quantrio's AWQ, which is very accurate.

[-]

FullOf_Bad_Ideas@reddit

I made a bunch of EXL3 quants for Qwen 3.5 397B - https://huggingface.co/cpral/Qwen3.5-397B-A17B-exl3

Not all of them are best for the size as "optimized" quants didn't quite overperformed normal quants, but there's a size for everyone starting at just 104 GiBs. I'll update them to make them more optimized later once I'll get the hang of it (messing with hand optimization on hermes 4 405b now)

I think you're trying to make your measurements directly comparable to exllamav3, is it done with 100 rows, the same as exllamav3 default? That would mean that Nvidia's NVFP4 performs as well as my 2.5bpw, that's just terrible.

[-]

ciprianveg@reddit

what about the Qwen published qptq, shouldn't be better? Qwen/Qwen3.5-397B-A17B-GPTQ-Int4

[-]

Phaelon74@reddit (OP)

LLM_Compressor team (Red Hat) prefers GPTQ. From all of my research, AWQs are better than GPTQ, but there's a lot more knobs with GTPQs than AWQs so yeah, maybe that would could be better. This took about 12 hours on Four B200s to do, sooo once my wallet recovers, I can try that GPTQ as well. If Qwen published GPTQs for the smaller ones, I should more easily be able to do those, and that could give us a solid comparable.

Sorry I didn't get to this one. I grabbed two NVFP4s and Two AWQs.

[-]

FitVariation5429@reddit

Very nice job! Looking foward to the benchmark of GPTQ!

[-]

ciprianveg@reddit

thank you for these tests. qwen published also gptq for the smaller 3.5 models.

[-]

einthecorgi2@reddit

Seems the updated the model, i wonder if it will do better.

[-]

PrysmX@reddit

Ohhh I've been using Seyho's NVFP4 the entire time for Qwen3 Coder Next and been scratching my head how so many people are calling out issues with NVFP4 when it has been working great for me lol.

[-]

victoryposition@reddit

I've found that nvidia's NVFP4 quants haven't been s-tier. Quantrio is an expert at calibration, which makes all the difference in the KLD.

[-]

Phaelon74@reddit (OP)

Fully agree, and we have solid proof that Datasets matter, coupled with how and where you quant. This a solid showing by them!

[-]

victoryposition@reddit

Which is such a surprising bummer from Nvidia. You'd think they'd want their NVFP4 format to really shine. But I suppose if the NVFP4 runs as well as the full quant, that might reduce compute needs and sell less hardware? I don't know...just seems weird their quants seem low effort.

[-]

Icy_Concentrate9182@reddit

Nvidia is trying to do too many things at the same time. it's natural they're dropping the ball on NVFP4.

NVFP4 however, from the hardware perspective, in my limited experience with a 5070ti, is brilliant.

I'm hoping we get rock solid llama.cpp support and some good models to start using ASAP... as vLLM has issues of their ow with their implementation...

but it's all been rather slow.. Blackwell has been out for more than a year!

[-]

AdamDhahabi@reddit

I just saw NVFP4 support was merged today https://github.com/ggml-org/llama.cpp/pull/19769

[-]

Icy_Concentrate9182@reddit

Not cuda yet. Hopefully soon

[-]

Phaelon74@reddit (OP)

It's their prime selling point for Blackwell, NVFP4 and yet, the secret sauce for it appears to be QAD, and QAD is EXPENSIVE. To QAD a 120B model, you need 32 GPUs for \~6 days. Soo not sure they want to pivot to doing that for the OSS community. For enterprise customers, they problem help them grab a model and QAD it, etc.

[-]

Monad_Maya@reddit

Quantization Aware Distillation - https://research.nvidia.com/labs/nemotron/files/NVFP4-QAD-Report.pdf

Adding it here since I didn't know either.

[-]

SkyFeistyLlama8@reddit

32 GPUs should be pennies for Nvidia.

[-]

victoryposition@reddit

Looks like I'm gonna have to take one for the team and build a 32 GPU system.

[-]

JockY@reddit

🫡

[-]

Phaelon74@reddit (OP)

This is where I am. I have a 120B model I LOVE, and I want it to be NVFP4 and good. I priced it out. About \~$6,000 for 32 A100's to do this and I just can't stomach that for the model I love :(

[-]

victoryposition@reddit

What's the model? Why does QAD need 2TB+ of vram to run that? Could a system with say 768GB vram and 2TB system ram do it.. albeit take longer?

[-]

Informal-Spinach-345@reddit

QuantTrio quants rule. Cyanwiki needs work. Thanks for the benchmarks.

[-]

jinnyjuice@reddit

Anyone else besides Quantrio you would recommend?

[-]

_cpatonn@reddit

Thanks for testing my quant, and raising this problem with me! It was true that there is quality issue with my Qwen 3.5 397B, as it was quantized from a different config from my other Qwen 3.5 quants.

It is being requantized at the moment :)

[-]

Phaelon74@reddit (OP)

Hey there:
1). To do BF16 would require Eight B200s, which no one could rent me at the time. Doing it against FP8 is fine as FP8 KLD divergence versus BF16 will literally be 00.001 or something to the effect. So doing it against BF16 is going to show all models unilaterally worse. At some point in the future, I WILL do BF16, but it's not needed. BF16 won't show any model to be better, but all models to be across the board worse, equally.
2). I load the full vocab on VRAM. Look at it again. I do exactly what Tubro does for EXL3. Full vocab loaded. This is why it takes soo much VRAM. We do Every position, Every Window, for FULL VOCAB.

Trust me, I spent 3 months doing this. It's rock solid. I'm happy to take all feedback and thank you for reviewing the code, but it does do full vocab.

To Triple check myself, I asked a colleague, and all three frontier models:
I reviewed my own code, my friend reviewed it, and we asked Opus, Gemini, and GPT5.4 and all of them agree with certainty, I load the full vocab.

GPU model running:
prompt_hidden_states = hidden_states[offset : offset + num_logits]

logits = self.model.compute_logits(prompt_hidden_states)

Compute_Logits calls this in qwen3.5:
def compute_logits(

self,

hidden_states: torch.Tensor,

) -> torch.Tensor | None:

return self.logits_processor(self.lm_head, hidden_states)

Logit processing is then done here:
def _get_logits(

self,

hidden_states: torch.Tensor,

lm_head: VocabParallelEmbedding,

embedding_bias: torch.Tensor | None,

) -> torch.Tensor | None:

logits = lm_head.quant_method.apply(lm_head, hidden_states, bias=embedding_bias)

logits = self._gather_logits(logits)

if logits is not None:

logits = logits[..., : self.org_vocab_size]

return logits

In phase 2, here is where again, full vocab is hammered home:
if is_kld_mode:

from safetensors.torch import safe_open

with safe_open(

request.reference_logits_path,

framework="pt",

device=str(self.device),

) as f:

ref_logits_full = f.get_tensor(request.reference_logits_key).to(

self.device

)

ref_logits = ref_logits_full[start_idx : start_idx + num_logits]

vs = min(logits.shape[-1], ref_logits.shape[-1])

log_probs_model = F.log_softmax(logits[..., :vs].float(), dim=-1)

log_probs_ref = F.log_softmax(ref_logits[..., :vs].float(), dim=-1)

kld_per_pos = F.kl_div(

log_probs_model,

log_probs_ref,

reduction="none",

log_target=True,

).sum(dim=-1)

kld_sum = kld_per_pos.sum().item()

kld_count = kld_per_pos.numel()

Every Position, Every Window, FULL VOCAB:
for start_idx in range(0, num_tokens - context_length + stride, stride):

end_idx = start_idx + context_length

if end_idx > num_tokens:

break

window_tokens = tokens[start_idx:end_idx]

[-]

festr__@reddit

u/Phaelon74 heelo, Lukaelonso did new nvfp4 quant and we also experimented doing hybrid quant - where shared expert is bf16 we have these reruslts now: (if you want to discuss with us on discord: https://discord.gg/wrzpQ9aHP9 just poke us on qwen35 channel)

Qwen3.5-397B-A17B NVFP4 — Full Comparison (all checkpoints)

Reference: Qwen/Qwen3.5-397B-A17B-FP8 (TP8) | Dataset: WikiText-2, 204,800 positions

┌────────────────────────────────┬──────────┬────────────┬─────────┬─────────┬─────────┬─────┐
│ Checkpoint │ Mean KLD │ Median KLD │ P95 KLD │ P99 KLD │ Max KLD │ NaN │ ├────────────────────────────────┼──────────┼────────────┼─────────┼─────────┼─────────┼─────┤ │ hybrid/lukealonso-NVFP4 │ 0.0352 │ 0.0068 │ 0.147 │ 0.521 │ 10.21 │ 0 │ ├────────────────────────────────┼──────────┼────────────┼─────────┼─────────┼─────────┼─────┤ │ lukealonso/Qwen3.5-NVFP4 │ 0.0590 │ 0.0143 │ 0.239 │ 0.798 │ 20.12 │ 0 │ ├────────────────────────────────┼──────────┼────────────┼─────────┼─────────┼─────────┼─────┤ │ hybrid/nvidia-NVFP4 │ 0.0845 │ 0.0188 │ 0.365 │ 1.182 │ 7.14 │ 0 │ ├────────────────────────────────┼──────────┼────────────┼─────────┼─────────┼─────────┼─────┤ │ nvidia/Qwen3.5-397B-A17B-NVFP4 │ 0.1085 │ 0.0273 │ 0.471 │ 1.409 │ 19.60 │ 0 │ └────────────────────────────────┴──────────┴────────────┴─────────┴─────────┴─────────┴─────┘ The hybrid/lukealonso checkpoint is 3× better than the original nvidia checkpoint. The nvidia original is the only one crossing the 0.1 "significant loss" threshold.

[-]

fiery_prometheus@reddit

It would be great to have unsloth here as well, considering how much they write about quantization and datasets, but I guess they don't make these kind of quants

[-]

Phaelon74@reddit (OP)

They are slowly getting into these, but the AWQ/GPTQ/NVFP4 quant lands are a bit different. We also need changes to how PPL and KLD are captured and done in Llama.cpp as it's an incomplete deployment, as it relates to Turbo's robustness and my implementation in vllm.

Take this with a grain of salt, as the scores for GGUFs are going to change, but here's a solid play in the KLD space, for some really good datapoints we have:

These are AES quants. When he finishes his work to bring llama.cpp up to snuff on PPL/KLD, we'll post a master graph for Llama3.1-8b-instruct, that will have his, Ubers, Unsloths, and AWQ/NVFP4s from myself, mratsim, quantrio, seyho, etc.

[-]

fiery_prometheus@reddit

What about adding agentic/long term tasks as well? It doesn't have to be state of the art or anything, but damage to kld_95 etc, can cause issues in long term agentic workflows where small errors propagate / cause drift. But sounds exciting, looking forward to the results! :-)

[-]

Phaelon74@reddit (OP)

That's not really how KLD works per se. I'll look for datasets that have some of that, but you don't do multi-turn to get KLD, as it's impossible to compare the first statement to the second, logit wise, since a quant will answer different, and then the second turn would be 100% different and impossible to compare.

[-]

fiery_prometheus@reddit

sorry for my bad notation :D So it's not possible to have kld values over longer sequences? If you compare the logits distribution for the quantized model with the truthful distribution in the kld function and then do that over sequences of tokens, ie \sum_{t=1}\^{tok_num} kld_token(P | Q), where P is the prob distribution for the quantized model, and Q is for the truthful model, for a given token, t. Wouldn't you have a function which would 'grow' the more 'divergent' the longer token sequences become, as the model generates logits distributions for the same input and starts to drift more and more?

This way we could quantify drift over sequences instead of only a single token, and accumulated drift would become much more measurable.

[-]

Phaelon74@reddit (OP)

This is why we have a fixed context of 2048 and sliding stride of 512. We overlap to gauge but we don't balloon out past 2048.

The point of KLD in general is divergence. If a model was trained at High context per turn, yes, that could be valuable, but you would need to give some type of instruction to the LLM to write a response of 12k tokens, or 32k tokens, etc, that is of value. You CANNOT do multi-turn, it's 1 turn only. You getting an LLM to wrtie 32k tokens in a response, as a single turn, is not very common anymore, with thinking, and tool calls, etc.

That long return, would really only be valuable to KLD if the model was trained on responding in LONG responses.

[-]

jinnyjuice@reddit

What about Qwen/Qwen3.5-122B-A10B-GPTQ-Int4, the original 4 bit from Qwen?

[-]

Phaelon74@reddit (OP)

I'll plan to do the rest of the Qwen family, when my 6000s free up again.

[-]

digitalfreshair@reddit

Super interesting, thanks for this

[-]

NNN_Throwaway2@reddit

Good info. I was just wondering if there were benches of these around.

[-]

Phaelon74@reddit (OP)

This one took all day yesterday and was expensive AF. I'll do the rest of the Qwens shortly as well, as those are smaller and can be done on smaller footprint, etc.

[-]

dtdisapointingresult@reddit

Why is it so expensive?

I'm not experienced with the Nvidia world, but for GGUFs, any home user can evaluate KLD/PPL stats in 5 minutes using llama.cpp's llama-perplexity command.

[-]

Phaelon74@reddit (OP)

You have to load the whole model into VRAM and the whole vocab of said model, so for every window, you then evaluate every position. The PPL/KLD implementation in llama.cpp is actuall lacking a lot, so AES has a PR to upgrade llama.cpps to what Tubro in EXL3 did that I then replicated into VLLM.

So Load the full base model into VRAM, run Logits, get reference Logits. Then run each Quant fully in VRAM and do on GPU Logit comparisons. it's WAY faster than going GPU to CPU, but it requires Extreme amounts of VRAM.

[-]

dtdisapointingresult@reddit

The PPL/KLD implementation in llama.cpp is actuall lacking a lot, so AES has a PR to upgrade llama.cpps to what Tubro in EXL3 did that I then replicated into VLLM.

Can you clarify what's lacking? I can't find the PR you reference. I'm a novice but I was counting on doing analysis like this soon to determine once and for all who's the best GGUF quantizer.

It is my understand what it does is:

Give llama-perplexity a bf16 gguf + 1 test corpus (wiki.raw.test) -> it outputs a gigantic file with all the logits of that bf16 for that corpus file
Give llama-perplexity any quant of the above gguf + same test corpus to generate its own logits + path to logit file generated at step 1 to compare with -> it outputs a lot of KLD stats based on the difference between bf16's logits and the quantized model's logits

The whole process takes about 5 minutes on my DGX Spark. (obviously not loading 400B's)

[-]

Phaelon74@reddit (OP)

Sure thing. Turbo uses a context window of 2048. Llama.cpp today only uses 512. llama.cpp also only does analysis on the last 256 of said 512 window. Turbo's implementation does ALL 2048. Every position, Every window, statistically analyzed. Then on the next window, He strides 512 right, so window 0 : 0-2047. Windows 1: 1535-3582, etc. So he has overlap. This is basically double counting, but it comes out like you are covering different context windows and again, every position, every window.

It is incredibly accurate in seeing divergence. So super accurate PPL/KLD because the logits are rock solid.

[-]

dtdisapointingresult@reddit

I see. If llama-perplexity sucks there, then I understand the appeal of a more accurate KLD measurement.

However, perhaps it was added recently, but one of the args you can give llama-perplexity is --ctx-size. Default is 512, but you could set it to 2048. Of course, internally it's possible it's not doing the overlapping windows like Turbo does.

Btw here is the command Unsloth used for their KLD experiments in Qwen 3.5:

CMD env LLAMA_SET_ROWS=1 ./llama.cpp/llama-perplexity --flash-attn on --fit off --batch-size 16384 --ubatch-size 16384 --parallel 1 --mlock --no-mmap --device CUDA0 --model /mnt/disks/unslothai/unslothai/Qwen3.5-35B-A3B-GGUF/BF16/Qwen3.5-35B-A3B-BF16-00001-of-00002.gguf --file /mnt/disks/unslothai/unslothai/wikitext-2-raw/wiki.test.raw --ctx-size 512 --save-all-logits /mnt/disks/unslothai/unslothai/Qwen3.5-35B-A3B-GGUF-NEW-UD/kld_logs/pipeline_base_logits.bin

[-]

JockY@reddit

If I understand correctly you’re saying that it’s expensive because you had to rent lots of GPUs for a long period of time?

Or are you saying that it cost a lot in electricity because you ran it locally?

[-]

Phaelon74@reddit (OP)

I can't run Qwen3.5-397B fully in VRAM, so had to rent them. It's a large model, and thus takes a long time to load in and out of memory, as it relates to VLLM, etc.

So it was $$USD expensive as it had to load and unload multiple models through it's testing, as we as building cuda graphs, etc.

[-]

JockY@reddit

Gotcha. Thanks for taking one for the team!

Amazing to see the KLD of Quanttrio’s AWQ vs the Nvidia NVFP4.

In your opinion is KLD telling enough of the story to be a reliable guide in picking a quant for agentic coding, or are there other factors at play that aren’t shown but have a non-negligible effect on quality?

[-]

Phaelon74@reddit (OP)

Great question. KLD is just math, but it's undisputable math. What it tells us is how large of a divergence a model is from it's base. this is a great general guide on "This model feels dumber, but why?". It is not the only metric we should use, and for real world production use-cases, we should rely heavily on Evals, which are purpose built. So for coding, if the best coding eval says Quant A is the best, but KLD says it’s mid of the pack, the eval reigns supreme. One interesting note tho, even if in an eval a quant is better at a discipline, it may be worse in practice if you are taking say real world knowledge and incorporating it into a narrow discipline, etc.

TLDR; KLD is a great barometer on how much intelligence is lost during quantization, and should be coupled with Evals for production environments. For us consumers, KLD is a solid first place to start.

[-]

celsowm@reddit

thanks for you job ! I am very interessed in those number because we want to change our old 235b fp8

[-]

grumd@reddit

Thank you for your hard work :) I won't be using the bigger model but benches for smaller model quants will be super helpful

[-]

Phaelon74@reddit (OP)

As soon as my work on this local training model is done, my 6000s will free up and I can hammer these out quickly! Will work to have that done by this weekend!

[-]

festr__@reddit

@Phaelon74 baseline is FP8 - but nvidia quantised BF16 - does it make any difference?

[-]

Phaelon74@reddit (OP)

Great call out! I could not find anyone to rent me Eight B200s, as the BF16 model cannot fit on Four B200s, so I worked with what I could/can. FP8 to BF16 are very very close, accuracy wise. So the only difference would be that the quants could be universally a little bit worse, if say Fp8 had a KLD of 0.0001 from Bf16, etc.

[-]

Professional-Bear857@reddit

May be a silly question but do the awq quants work in lm studio for mac is?

[-]

chisleu@reddit

You should be running mlx versions IMHO

[-]

VectorD@reddit

I am Sehyo, the creator of the quant mentioned above. Thanks for this graph / mention!

[-]

Phaelon74@reddit (OP)

For sure. Do you share what Datasets you use for your quanting? If not no worries. As I continue to make KLD for VLLM based inferencing, once thing that another redditor pointed out, was that datasets are incredibly important. It would be interesting to compare the datasets you used versus what Nvidia used.

Also, did you do any QAD or just PTQ? Are your recipes OSS or do you keep them private? Wondering if you are addressing different objects in your quanting that has lead to your NVFP4 being way better.

Anything you want to share, would love to know it! Your NVFP4 is closer to mine when I do first epoch QAD, so would be interested to learn more from ya!

[-]

VectorD@reddit

The calibration information is written on the model cards. I have a PR to LLM Compressor at the moment which adds support for Qwen3.5, mtp, multimodal, etc, which is the PR used for my quants. You can find the PR from Sehyo on github.

[-]

Phaelon74@reddit (OP)

Yeap, I also made a PR for Qwen3.5 and then canceled mine as I did not notice you had one. We have questions for you on that thread!

Thumbs up to the model cards, I'll take a look. So I'm assuming then you did not do QAD?

[-]

VectorD@reddit

Ah I will get back to it soon, too much work making me tired these days. And yes, no QAD

[-]

Phaelon74@reddit (OP)

10-4, I feel ya. my QAD of Llama3.1-8B-Instruct will be finishing in the next 4-5 days and I then plan to make a larger post about NVFP4 in general.

[-]

TaiMaiShu-71@reddit

I've been using the Nvidia model, it performed decently, but looking at this I'm going to try out quanttrio model and see if it's better.

[-]

Phaelon74@reddit (OP)

And that's the key, if the model is doing what you need it to do, rocksauce. If it "Feels", it does appear to be subpar, per KLD, and to try another one to see if it's better.

The more data/info we have, the better off we are!

[-]

sean_hash@reddit

KLD divergence on a 397B MoE is tricky, per-expert error compounds through routing . calibration dataset ends up mattering way more than bit format at that scale.

[-]

Phaelon74@reddit (OP)

KK, I got you now. I use WIkitext ONLY for KLD. This is what EXL3 uses, so when I built KLD into VLLM, I did it with this purpose so we could do true apples to apples, etc.

I'll work in the future to utilize a dataset similar to what I do when I quant, but to be aligned, KLD is just KLD. Wikitext is plain enough, that I find it hard to believe a robust quant would fail so hard on it, which could also allude to the quanter not using a robust dataset either!

[-]

Phaelon74@reddit (OP)

Thanks for the feedback! Since this is truly logit derived, walk me through where this breaks down? KLD is how Nvidia does QAT/QAD, on all models, include MOEs. Why would they depend on KLD as their main metric if it was this prone to issues? Also, no sass intended, just asking to learn, etc.