nvidia/Qwen3.6-35B-A3B-NVFP4 · Hugging Face

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 46 comments

The NVIDIA Qwen3.6-35B-A3B-NVFP4 model is the quantized version of Alibaba's Qwen3.6-35B-A3B model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check here. The NVIDIA Qwen3.6-35B-A3B-NVFP4 model is quantized with Model Optimizer.

Post Training Quantization

This model was obtained by quantizing the weights of Qwen3.6-35B-A3B to NVFP4 data type, ready for inference with vLLM. Only the weights and activations of the linear operators within transformer blocks in MoE are quantized. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 3.06x.

Evaluation

The accuracy benchmark results are presented in the table below:

Precision	MMLU Pro	GPQA Diamond	τ²-Bench Telecom	SciCode	AIME 2025	AA-LCR	IFBench	MMMU PRO
BF16	85.6	84.9	95.5	40.8	89.2	62.0	62.3	74.1
NVFP4	85.0	84.8	94.7	40.6	88.8	62.0	62.8	74.5

[-]

panamory@reddit

Can someone explain to me what is the difference with this NVFP4 version and the other versions, like:

https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4 (2.7M downloads)
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-NVFP4 (170k downloads)

Is this supposed to be better? Why and how? The RedHAdAI version seems to also have the mode_visual file, which seems to be missing from the nvidia version. Also the MTP file seems to be separated in the RedHatAI version, which I think means a smaller size if you don't want MTP?

[-]

computehungry@reddit

This one probably has QAD. Trained more after quantizing. Because in the huggingface page, it says training data undisclosed. Usually, nvidia says 'no training' for a lot of models.

[-]

JustASheepInTheFlock@reddit

The packaging and recipe differ:

RedHatAI uses LLM Compressor; weights ship as separate model.safetensors + model_mtp.safetensors (MTP) + model_visual.safetensors (vision).
nvidia uses TensorRT Model Optimizer (served with --quantization modelopt); ModelOpt by default keeps self-attention at higher precision, which can preserve accuracy on attention-sensitive tasks. Weights ship as 3 fused shards.

Either recipe typically lands within \~1% of the BF16 baseline on language tasks; this benchmark measures the practical, deployment-relevant gap on the workloads you care about.

Quick run on DGX Spark

Coding/Vision/Tool Calling -> Both are same level.

Rule-Following -> nvidia is better

Summarisation -> redhatAI is better

[-]

jonydevidson@reddit

This, along with the post training, was done by the company powering the whole circus, so if anyone out there can make the best quant for speed and quality, they're probably among the top 5 candidates.

[-]

annodomini@reddit

When you make an NVFP4 (or most other forms of quantization these days), you choose certain parameters to leave unquantized, or quantize less aggressively, or use certain scaling methods that scale them in a way that quantization error is minimized. You do this by measuring how much those parameters affect next token predictions on a calibration dataset, and adjusting them until you get the best performance possible.

So each of these different quantizations may use different calibration datasets. And the calibration dataset can affect how well the quant works for different tasks; for example, if you use an English-only calibratioin dataset, you may lose more performance on other languages than you lose on English. Or if you only use a QA based dataset, you may lose more agentic performance than a dataset that includes agentic traces with a lot of tool calling.

Then you can test afterwards to see how much the quantization actually affected performance on various task evaluations, or you can compare on metrics like KL-divergence (which are also measured against a particular dataset) from the base model that you quantized from.

This particular NVIDIA quant shows very minimal performance decrease on a number of common evaluations. The others cited (RedHat and Unsloth), don't show the same kind of benchmarks, so it's hard to compare.

Unfortunately, there's aren't very good independent benchmarks comparing different quants of the same models. You kind of have to just go with the best info you've got, your gut feel, or do your own evals to compare different quants.

[-]

SheikhYarbuti@reddit

TIL, Thanks!

[-]

6efeet@reddit

And how might this compare to PrismaQuant?

Xamanthas@reddit

Reminder that no consumer nvidia card, including the 6000 have support for NVFP4, its all just fallback shit.

autisticit@reddit

You mean no support at software level, but at hardware level it's supported right? Right?

LostDrengr@reddit

Yeah I seen a few people parrot this recently, mainly on reddit. So I done a quick dip into it and its misleading because hardware wise it absolutely is wrong and the detail seems to be lost in the software stack where early on it was badly implemented or lacked support.

What seems to be the case right now is the software is moving so that these "fallback" slurs are outdated or people just not explaining the point well.

see_spot_ruminate@reddit

I think it is people who are salty and missing out on the new hardware. Lot's of people are invested into their aging 3090's (which are good, but missing new features)

Where do you source this information?

silenceimpaired@reddit

Oh really? I thought the 5090 did.

brown2green@reddit

They never quantize the input/output layers and the attention, so their "4-bit" quantizations are always too big in practice for 24GB GPUs.

I loaded it, around 47GB used, without mtp. Wow

jadbox@reddit

46gb of vram?!

Client_Hello@reddit

The model weights are 23.5gb, and its MOE so it doesnt all have to be in vram. Any Blackwell card will run this even with only 8gb vram

QuestionMarker@reddit

Yeah but isn't this quant backwards for wanting to do that? Ordinarily you'd want the MoE layers in RAM/CPU, but those are exactly the ones they've dropped to NVFP4 here.

Yeah, you are right, makes sense. You could squeeze this into 32gb vram with limited context, otherwise need more.

CheatCodesOfLife@reddit

Yep, this was clearly designed to be run entirely on Blackwell.

44.5GB actually

Yikes... I'll stick to unsloth GGUF, thank you.

Iwaku_Real@reddit

Including KV cache

lucidml_lover@reddit

Don't ffn mlp weights make up most of the model anyways? And that is quantised

ThePixelHunter@reddit

Right but if they did, wouldn't quality go to shit? Most quants leave those layers in fp8/fp16 for a reason.

Most GGUF quantizations in practice actually don't and use 6-bit or less for input/output and attention especially, as far as I've seen.

In any case, performance would definitely decrease by quantizing those layers.

rainbyte@reddit

I noticed gguf uses 5bit or 6bit layer, but are those optimized for hardware in some way or emulated?

vLLM compatible quants have a mix of 4+16, 8+8 or 8+16 bit layers, which map perfectly to real hardware.

I think llama.cpp has optimizations (packing, etc.) for mapping those formats efficiently to hardware-native precision, but I don't know the details.

That's good to make it work instead of just failing, but it will decrease performance if there is no accelerated hardware.

Example: on 3090 the fp8 quants get handled by fp16 cores, which are much slower than int8 cores.

I guess 5bit and 6bit are handled by int8 cores, so in some cases q6 will not be much faster than q8, right?

Intelligent-Form6624@reddit

gguf wen?

appakaradi@reddit

why does it take so long for nvidia to produce a quantized version?

chocofoxy@reddit

ikr i have been using the redhat nvpf4 for a month already

siegevjorn@reddit

Any post training done after quantization? Then the benchmark numbers are meaningless.

G G U F W H E N

LinkSea8324@reddit

Managed to get it (another nvfp4, not modelopt) running on RTX 5090 and vllm, working much faster than AWQ

Good luck with sglang lol

crossoverXYZ@reddit

interesting but I wish they'd compare against standard Q4 quants. their method skips attention layers so the actual compression ratio isn't as impressive as it looks

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

pmttyji@reddit (OP)

I think so. Just searched model card for MTP & found 1 result with vllm serve command. So it's there.

swagonflyyyy@reddit

Benchmarks are impressive, nearly identical performance to bf16.

uti24@reddit

It would ne nice to also see comparison with Q4 without post training