nvidia/Qwen3.6-35B-A3B-NVFP4 · Hugging Face
Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 46 comments
The NVIDIA Qwen3.6-35B-A3B-NVFP4 model is the quantized version of Alibaba's Qwen3.6-35B-A3B model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check here. The NVIDIA Qwen3.6-35B-A3B-NVFP4 model is quantized with Model Optimizer.
Post Training Quantization
This model was obtained by quantizing the weights of Qwen3.6-35B-A3B to NVFP4 data type, ready for inference with vLLM. Only the weights and activations of the linear operators within transformer blocks in MoE are quantized. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 3.06x.
Evaluation
The accuracy benchmark results are presented in the table below:
| Precision | MMLU Pro | GPQA Diamond | τ²-Bench Telecom | SciCode | AIME 2025 | AA-LCR | IFBench | MMMU PRO |
|---|---|---|---|---|---|---|---|---|
| BF16 | 85.6 | 84.9 | 95.5 | 40.8 | 89.2 | 62.0 | 62.3 | 74.1 |
| NVFP4 | 85.0 | 84.8 | 94.7 | 40.6 | 88.8 | 62.0 | 62.8 | 74.5 |
panamory@reddit
Can someone explain to me what is the difference with this NVFP4 version and the other versions, like:
https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4 (2.7M downloads)
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-NVFP4 (170k downloads)
Is this supposed to be better? Why and how? The RedHAdAI version seems to also have the mode_visual file, which seems to be missing from the nvidia version. Also the MTP file seems to be separated in the RedHatAI version, which I think means a smaller size if you don't want MTP?
computehungry@reddit
This one probably has QAD. Trained more after quantizing. Because in the huggingface page, it says training data undisclosed. Usually, nvidia says 'no training' for a lot of models.
JustASheepInTheFlock@reddit
The packaging and recipe differ:
model.safetensors+model_mtp.safetensors(MTP) +model_visual.safetensors(vision).--quantization modelopt); ModelOpt by default keeps self-attention at higher precision, which can preserve accuracy on attention-sensitive tasks. Weights ship as 3 fused shards.Either recipe typically lands within \~1% of the BF16 baseline on language tasks; this benchmark measures the practical, deployment-relevant gap on the workloads you care about.
Quick run on DGX Spark
Coding/Vision/Tool Calling -> Both are same level.
Rule-Following -> nvidia is better
Summarisation -> redhatAI is better
jonydevidson@reddit
This, along with the post training, was done by the company powering the whole circus, so if anyone out there can make the best quant for speed and quality, they're probably among the top 5 candidates.
annodomini@reddit
When you make an NVFP4 (or most other forms of quantization these days), you choose certain parameters to leave unquantized, or quantize less aggressively, or use certain scaling methods that scale them in a way that quantization error is minimized. You do this by measuring how much those parameters affect next token predictions on a calibration dataset, and adjusting them until you get the best performance possible.
So each of these different quantizations may use different calibration datasets. And the calibration dataset can affect how well the quant works for different tasks; for example, if you use an English-only calibratioin dataset, you may lose more performance on other languages than you lose on English. Or if you only use a QA based dataset, you may lose more agentic performance than a dataset that includes agentic traces with a lot of tool calling.
Then you can test afterwards to see how much the quantization actually affected performance on various task evaluations, or you can compare on metrics like KL-divergence (which are also measured against a particular dataset) from the base model that you quantized from.
This particular NVIDIA quant shows very minimal performance decrease on a number of common evaluations. The others cited (RedHat and Unsloth), don't show the same kind of benchmarks, so it's hard to compare.
Unfortunately, there's aren't very good independent benchmarks comparing different quants of the same models. You kind of have to just go with the best info you've got, your gut feel, or do your own evals to compare different quants.
SheikhYarbuti@reddit
TIL, Thanks!
6efeet@reddit
And how might this compare to PrismaQuant?
Xamanthas@reddit
Reminder that no consumer nvidia card, including the 6000 have support for NVFP4, its all just fallback shit.
autisticit@reddit
You mean no support at software level, but at hardware level it's supported right? Right?
LostDrengr@reddit
Yeah I seen a few people parrot this recently, mainly on reddit. So I done a quick dip into it and its misleading because hardware wise it absolutely is wrong and the detail seems to be lost in the software stack where early on it was badly implemented or lacked support.
What seems to be the case right now is the software is moving so that these "fallback" slurs are outdated or people just not explaining the point well.
see_spot_ruminate@reddit
I think it is people who are salty and missing out on the new hardware. Lot's of people are invested into their aging 3090's (which are good, but missing new features)
LostDrengr@reddit
Where do you source this information?
silenceimpaired@reddit
Oh really? I thought the 5090 did.
brown2green@reddit
They never quantize the input/output layers and the attention, so their "4-bit" quantizations are always too big in practice for 24GB GPUs.
autisticit@reddit
I loaded it, around 47GB used, without mtp. Wow
jadbox@reddit
46gb of vram?!
Client_Hello@reddit
The model weights are 23.5gb, and its MOE so it doesnt all have to be in vram. Any Blackwell card will run this even with only 8gb vram
QuestionMarker@reddit
Yeah but isn't this quant backwards for wanting to do that? Ordinarily you'd want the MoE layers in RAM/CPU, but those are exactly the ones they've dropped to NVFP4 here.
Client_Hello@reddit
Yeah, you are right, makes sense. You could squeeze this into 32gb vram with limited context, otherwise need more.
CheatCodesOfLife@reddit
Yep, this was clearly designed to be run entirely on Blackwell.
autisticit@reddit
44.5GB actually
jadbox@reddit
Yikes... I'll stick to unsloth GGUF, thank you.
Iwaku_Real@reddit
Including KV cache
lucidml_lover@reddit
Don't ffn mlp weights make up most of the model anyways? And that is quantised
ThePixelHunter@reddit
Right but if they did, wouldn't quality go to shit? Most quants leave those layers in fp8/fp16 for a reason.
brown2green@reddit
Most GGUF quantizations in practice actually don't and use 6-bit or less for input/output and attention especially, as far as I've seen.
In any case, performance would definitely decrease by quantizing those layers.
rainbyte@reddit
I noticed gguf uses 5bit or 6bit layer, but are those optimized for hardware in some way or emulated?
vLLM compatible quants have a mix of 4+16, 8+8 or 8+16 bit layers, which map perfectly to real hardware.
brown2green@reddit
I think llama.cpp has optimizations (packing, etc.) for mapping those formats efficiently to hardware-native precision, but I don't know the details.
rainbyte@reddit
That's good to make it work instead of just failing, but it will decrease performance if there is no accelerated hardware.
Example: on 3090 the fp8 quants get handled by fp16 cores, which are much slower than int8 cores.
I guess 5bit and 6bit are handled by int8 cores, so in some cases q6 will not be much faster than q8, right?
Intelligent-Form6624@reddit
gguf wen?
appakaradi@reddit
why does it take so long for nvidia to produce a quantized version?
chocofoxy@reddit
ikr i have been using the redhat nvpf4 for a month already
siegevjorn@reddit
Any post training done after quantization? Then the benchmark numbers are meaningless.
jonydevidson@reddit
G G U F W H E N
LinkSea8324@reddit
Managed to get it (another nvfp4, not modelopt) running on RTX 5090 and vllm, working much faster than AWQ
Good luck with sglang lol
crossoverXYZ@reddit
interesting but I wish they'd compare against standard Q4 quants. their method skips attention layers so the actual compression ratio isn't as impressive as it looks
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
oxygen_addiction@reddit
Use the appropriate flair please.
piscoster@reddit
On which hardware are you running this?
ortegaalfredo@reddit
NVFP4 is great but Intel autoround INT4 is also great and faster. I some autoround Int8 quants have even more quality than FP8.
Hefty_Suggestion6608@reddit
How different is this to unsloth/Qwen3.6-35B-A3B in Q8_0 ?
HavenTerminal_com@reddit
3.06x is on the MoE weights only. Attention layers stay fp16, so the actual VRAM footprint is messier than the headline.
xjE4644Eyc@reddit
Is MTP still active on these?
pmttyji@reddit (OP)
I think so. Just searched model card for MTP & found 1 result with vllm serve command. So it's there.
swagonflyyyy@reddit
Benchmarks are impressive, nearly identical performance to bf16.
uti24@reddit
It would ne nice to also see comparison with Q4 without post training