nvidia/Qwen3.6-35B-A3B-NVFP4 · Hugging Face

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 46 comments

The NVIDIA Qwen3.6-35B-A3B-NVFP4 model is the quantized version of Alibaba's Qwen3.6-35B-A3B model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check here. The NVIDIA Qwen3.6-35B-A3B-NVFP4 model is quantized with Model Optimizer.

Post Training Quantization

This model was obtained by quantizing the weights of Qwen3.6-35B-A3B to NVFP4 data type, ready for inference with vLLM. Only the weights and activations of the linear operators within transformer blocks in MoE are quantized. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 3.06x.

Evaluation

The accuracy benchmark results are presented in the table below:

Precision MMLU Pro GPQA Diamond τ²-Bench Telecom SciCode AIME 2025 AA-LCR IFBench MMMU PRO
BF16 85.6 84.9 95.5 40.8 89.2 62.0 62.3 74.1
NVFP4 85.0 84.8 94.7 40.6 88.8 62.0 62.8 74.5