nvidia/Gemma-4-26B-A4B-NVFP4
Posted by reto-wyss@reddit | LocalLLaMA | View on Reddit | 30 comments
- Can confirm it works on a 5090, with 80% allocation (of 32gb) I got around 50k context.
- It's 18.8GB
| Benchmark | Baseline (Full Precision) | NVFP4 |
|---|---|---|
| GPQA Diamond | 80.30% | 79.90% |
| AIME 2025 | 88.95% | 90.00% |
| MMLU Pro | 85.00% | 84.80% |
| LiveCodeBench (pass@1) | 80.50% | 79.80% |
| IFBench | 77.77% | 78.1% |
| IFEval | 96.60% | 96.40% |
tylerrobb@reddit
Are you running the model with native FP16 KV Cache? That might be why you're only geting 50k context on a 5090.
Use that NVFP4 KV Cache instead to hit a much larger context in vLLM:
-kv-cache-dtype fp4MaruluVR@reddit
Could be because I am working a lot with Japanese but I had horrible results every time I dont use full precision cache. I run Gemma 4 26B at Q8 K XL.
SexyAlienHotTubWater@reddit
Quantizing the cache straight destroys LLM performance. You can't quantize it like you can quantize weights.
tylerrobb@reddit
Are you on Blackwell by chance?
The 5090 hardware can do the math directly in FP4 instead of reverting back to FP16. It does a much better job of handling some of the activation quirks so the stability is better than on 40/30 series cards.
SexyAlienHotTubWater@reddit
I don't mean efficiency, I mean intelligence. KV doesn't truncate well because the information is sparse - doesn't matter what compute unit you're using. (This is the intuition behind why TurboQuant works)
tylerrobb@reddit
Yeah, Japanese, Korean, and Arabic are harder because each token has a lot more semantic weight. English is pretty basic in comparison.
Code is even more structured and less susceptible to cache quantization.
InterestTracker9000@reddit
I found out the Japanese stuff the hard way. I just happen to need a bunch of random languages to test my software with, and Japanese was one of the languages I picked because I wanted to test the characters versus letters, and with 16k context I could handle 45 pages of English without any issues in our test case, but even 12 pages of Japanese would overflow because of how it converted everything into tokens through our OCR.
However this worked out great for us because we caught some errors on our end that would have never happened in normal use cases, and we learned quite a bit about how to better handle some of this. Gotta learn the hard way in some instances.
reto-wyss@reddit (OP)
Why stop there? I prefer 0-bit kv-cache, it fits infinite contex :)
Glittering-Call8746@reddit
Do u have the docker u using ? I have problems with vllm v0
PWani_22@reddit
Seen any gguf versions of this?
qfox337@reddit
I don't get why this is interesting ... Is it faster? (I didn't see such benchmarks.) Or simply better quality than most 4 bit quantization?
FullOf_Bad_Ideas@reddit
If you use NVFP4 compute for weights and activations and therefore hardware-specific acceleration paths of Blackwell, you're not dealing with BF16 TFLOPs being the ceiling - but with FP4 ceiling. B200 has 2200 BF16 TFLOPS but 9000 FP4 TFLOPS.
So you can serve 4x throughput assuming you were bottlenecked by compute.
Hardware providers like DeepInfra can pick up that quant and serve it cheaply.
Community running single batch inference on their single RTX 6000 Pro isn't the main target of NVFP4 quants.
qfox337@reddit
Ah cool, thanks for the clarification!! It's a bit odd that the key motivation doesn't have benchmarks clearly listed on the model page ...
TBH I could find some batch inference uses myself if that were cheap/fast. I assume the RAM use might be a bit of a squeeze on consumer gpus though.
SomeoneSimple@reddit
I can't tell any actual speed gains since I don't have Blackwell, but running a standard Q4_K GGUF (e.g. from unsloth) in llama.cpp dequantizes the Q4_K weights back to FP16, so there the only gain in performance comes from smaller memory footprint, while the NVFP4 model runs in mixed precision with FP8 and INT8 tensors.
Locke_Kincaid@reddit
Google updated the chat template on the main repo just 3 days ago and NVIDIA's repo is still using the old one, so grab the new one from google!
beijinghouse@reddit
"The NVIDIA Gemma 4 26B IT NVFP4 model is quantized with NVIDIA Model Optimizer."
So there's QAT happening and not just blind PTQ. That explains performance not dropping.
FullOf_Bad_Ideas@reddit
They also say
there's no QAT here
computehungry@reddit
They're never really clear about what they do with that package, wish they gave us a hint, otherwise no reason to believe it's better than q4k quants.
rerri@reddit
I don't think this quant went through any QAT. Just PTQ with the calibration datasets mentioned in model card.
Its-all-redditive@reddit
Evaluation results seem odd. NVFP4 outscoring full precision? These must not be an average score over lots of runs.
Noxusequal@reddit
Also this comes back to a problem with how experiments are done on the llm space.
If you don't make damn sure that you controll every aspect of the pipeline. In this case seed, temperature and batching. You get fluctuations. If you don't set model seeds and anything but temp 0 and only do 1 run over a benchmark you have an unknown level of uncertainty of how much this performance would differ between runs.
If you run them with a controlled seed but batch size is not 1 or some specific kernels vor vllm Well the results do also have an unknown level of fluctuations (from personal testing 2-3% but it depends on the exact benchmark) due to a problem that entries in a batch so infact effect each other. Which means API models are never fully controlable btw.
So if we see benchmarks like this if they are not run X times and errors are calculated you never know if you see outliers. But benchmarks take really long or wit proprietary models are really expensive therefore it's not always feasible.
If the scores across multiple benchmarks fluctuate around each other mostly means yeah these quanta in this case seem to perform pretty similar. And no big interpretation should be given to +-2%.
lit1337@reddit
Totally expected with MoE architectures actually. I ran 90 individual ablation experiments on this model for my Cerebellum quant series — imatrix-guided lower precision acts as regularization on the expert routing. With 128 experts per layer, the router over-commits to dominant pathways at full precision. Slightly noisier weights from quantization help distribute tokens more evenly across experts, same principle as dropout preventing co-adaptation during training. 3 of 5 tensor groups (attn_k, ffn_up, expert_gate_up) actively improve perplexity at Q2_K vs their BF16 values. My mixed-precision quant at 11 GB hits 19,826 WikiText PPL vs \~27,000 for BF16. Not measurement noise consistent, reproducible regularization effect specific to sparse MoE models. Dense models don't show this.
szansky@reddit
How about vs Qwen 3.6 27B?
djm07231@reddit
NVFP4 support seems very weird for non datacenter GPUs.
I have heard things like GB200 and consumer products like A6000 Blackwell RTX 50 series have slightly different compute capabilities and not entirely compatible with one another.
reto-wyss@reddit (OP)
It's not the same. RTX Pro 6000 is sm_120 and Spark is sm_121, neither is the same as the DC products, they can do NVFP4 but need adjustments in implementation. VLLM_CUTLASS has gotten a lot better over the last three or so months, and it works fine with PRO 6k in many cases.
annodomini@reddit
Anyone tried the petit kernels to run NVFP4 on ROCm?
These NVFP4 results looks really good, wondering how well they'll run on AMD without native support.
ubrtnk@reddit
lit1337@reddit
https://huggingface.co/deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v3-GGUF
Former-Tangerine-723@reddit
Sorry mate but this has nothing in commonplace with nvfp4
ubrtnk@reddit
https://i.imgflip.com/aqmdie.jpg