nvidia/Gemma-4-26B-A4B-NVFP4

Benchmark	Baseline (Full Precision)	NVFP4
GPQA Diamond	80.30%	79.90%
AIME 2025	88.95%	90.00%
MMLU Pro	85.00%	84.80%
LiveCodeBench (pass@1)	80.50%	79.80%
IFBench	77.77%	78.1%
IFEval	96.60%	96.40%

[-]

tylerrobb@reddit

Are you running the model with native FP16 KV Cache? That might be why you're only geting 50k context on a 5090.

Use that NVFP4 KV Cache instead to hit a much larger context in vLLM:
-kv-cache-dtype fp4

[-]

MaruluVR@reddit

Could be because I am working a lot with Japanese but I had horrible results every time I dont use full precision cache. I run Gemma 4 26B at Q8 K XL.

[-]

SexyAlienHotTubWater@reddit

Quantizing the cache straight destroys LLM performance. You can't quantize it like you can quantize weights.

[-]

The 5090 hardware can do the math directly in FP4 instead of reverting back to FP16. It does a much better job of handling some of the activation quirks so the stability is better than on 40/30 series cards.

[-]

SexyAlienHotTubWater@reddit

I don't mean efficiency, I mean intelligence. KV doesn't truncate well because the information is sparse - doesn't matter what compute unit you're using. (This is the intuition behind why TurboQuant works)

[-]

tylerrobb@reddit

Yeah, Japanese, Korean, and Arabic are harder because each token has a lot more semantic weight. English is pretty basic in comparison.

Code is even more structured and less susceptible to cache quantization.

[-]

InterestTracker9000@reddit

I found out the Japanese stuff the hard way. I just happen to need a bunch of random languages to test my software with, and Japanese was one of the languages I picked because I wanted to test the characters versus letters, and with 16k context I could handle 45 pages of English without any issues in our test case, but even 12 pages of Japanese would overflow because of how it converted everything into tokens through our OCR.

However this worked out great for us because we caught some errors on our end that would have never happened in normal use cases, and we learned quite a bit about how to better handle some of this. Gotta learn the hard way in some instances.

[-]

reto-wyss@reddit (OP)

Why stop there? I prefer 0-bit kv-cache, it fits infinite contex :)

[-]

Glittering-Call8746@reddit

Do u have the docker u using ? I have problems with vllm v0

[-]

PWani_22@reddit

Seen any gguf versions of this?

[-]

qfox337@reddit

I don't get why this is interesting ... Is it faster? (I didn't see such benchmarks.) Or simply better quality than most 4 bit quantization?

[-]

FullOf_Bad_Ideas@reddit

If you use NVFP4 compute for weights and activations and therefore hardware-specific acceleration paths of Blackwell, you're not dealing with BF16 TFLOPs being the ceiling - but with FP4 ceiling. B200 has 2200 BF16 TFLOPS but 9000 FP4 TFLOPS.

So you can serve 4x throughput assuming you were bottlenecked by compute.

Hardware providers like DeepInfra can pick up that quant and serve it cheaply.

Community running single batch inference on their single RTX 6000 Pro isn't the main target of NVFP4 quants.

[-]

qfox337@reddit

Ah cool, thanks for the clarification!! It's a bit odd that the key motivation doesn't have benchmarks clearly listed on the model page ...

TBH I could find some batch inference uses myself if that were cheap/fast. I assume the RAM use might be a bit of a squeeze on consumer gpus though.

[-]

SomeoneSimple@reddit

I can't tell any actual speed gains since I don't have Blackwell, but running a standard Q4_K GGUF (e.g. from unsloth) in llama.cpp dequantizes the Q4_K weights back to FP16, so there the only gain in performance comes from smaller memory footprint, while the NVFP4 model runs in mixed precision with FP8 and INT8 tensors.

[-]

Locke_Kincaid@reddit

Google updated the chat template on the main repo just 3 days ago and NVIDIA's repo is still using the old one, so grab the new one from google!

[-]

beijinghouse@reddit

"The NVIDIA Gemma 4 26B IT NVFP4 model is quantized with NVIDIA Model Optimizer."

So there's QAT happening and not just blind PTQ. That explains performance not dropping.

[-]

FullOf_Bad_Ideas@reddit

They also say

We calibrated the model using the dataset noted below, and performed evaluation using the benchmarks noted under Evaluation Datasets. We did not perform training or testing for this Model Optimizer release. The methods noted under Training and Testing Datasets below represent the data collection and labeling methods used by the third-party to train and test the underlying Gemma 4 26B IT model.

there's no QAT here

[-]

computehungry@reddit

They're never really clear about what they do with that package, wish they gave us a hint, otherwise no reason to believe it's better than q4k quants.

[-]

rerri@reddit

I don't think this quant went through any QAT. Just PTQ with the calibration datasets mentioned in model card.

[-]

Its-all-redditive@reddit

Evaluation results seem odd. NVFP4 outscoring full precision? These must not be an average score over lots of runs.

[-]

Noxusequal@reddit

Also this comes back to a problem with how experiments are done on the llm space.

If you don't make damn sure that you controll every aspect of the pipeline. In this case seed, temperature and batching. You get fluctuations. If you don't set model seeds and anything but temp 0 and only do 1 run over a benchmark you have an unknown level of uncertainty of how much this performance would differ between runs.

If you run them with a controlled seed but batch size is not 1 or some specific kernels vor vllm Well the results do also have an unknown level of fluctuations (from personal testing 2-3% but it depends on the exact benchmark) due to a problem that entries in a batch so infact effect each other. Which means API models are never fully controlable btw.

So if we see benchmarks like this if they are not run X times and errors are calculated you never know if you see outliers. But benchmarks take really long or wit proprietary models are really expensive therefore it's not always feasible.

If the scores across multiple benchmarks fluctuate around each other mostly means yeah these quanta in this case seem to perform pretty similar. And no big interpretation should be given to +-2%.

[-]

lit1337@reddit

Totally expected with MoE architectures actually. I ran 90 individual ablation experiments on this model for my Cerebellum quant series — imatrix-guided lower precision acts as regularization on the expert routing. With 128 experts per layer, the router over-commits to dominant pathways at full precision. Slightly noisier weights from quantization help distribute tokens more evenly across experts, same principle as dropout preventing co-adaptation during training. 3 of 5 tensor groups (attn_k, ffn_up, expert_gate_up) actively improve perplexity at Q2_K vs their BF16 values. My mixed-precision quant at 11 GB hits 19,826 WikiText PPL vs \~27,000 for BF16. Not measurement noise consistent, reproducible regularization effect specific to sparse MoE models. Dense models don't show this.

Benchmark	Cerebellum v3 (11 GB)	Q3_K_M (13 GB)	Q4_K_M (16 GB)
WikiText PPL	19,826	42,369	27,362
HumanEval pass@1	67.1%	62.2%	59.8%
ARC-Challenge	95.5%	95.2%	96.7%
HellaSwag	83.8%	86.6%	85.2%
MMLU-Redux	71.3%	73.7%	72.7%

[-]

szansky@reddit

How about vs Qwen 3.6 27B?

[-]

djm07231@reddit

NVFP4 support seems very weird for non datacenter GPUs.

I have heard things like GB200 and consumer products like A6000 Blackwell RTX 50 series have slightly different compute capabilities and not entirely compatible with one another.

[-]

reto-wyss@reddit (OP)

It's not the same. RTX Pro 6000 is sm_120 and Spark is sm_121, neither is the same as the DC products, they can do NVFP4 but need adjustments in implementation. VLLM_CUTLASS has gotten a lot better over the last three or so months, and it works fine with PRO 6k in many cases.

[-]

annodomini@reddit

Anyone tried the petit kernels to run NVFP4 on ROCm?

These NVFP4 results looks really good, wondering how well they'll run on AMD without native support.

[-]