llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged
Posted by ggonavyy@reddit | LocalLLaMA | View on Reddit | 38 comments
https://github.com/ggml-org/llama.cpp/pull/22196
And somehow we already got some GGUFs for it!
https://huggingface.co/CISCai/gemma-4-31B-it-NVFP4-turbo-GGUF
https://huggingface.co/stevelikesrhino/gemma-4-31B-it-nvfp4-GGUF
(the below one is from PR author himself)
https://huggingface.co/michaelw9999/Nemotron-Cascade-2-30B-A3B-NVFP4-GGUF
AlwaysLateToThaParty@reddit
Does anyone know if the mxfp4 quantization would be affected by this?
Glittering-Call8746@reddit
Can this work with moe with cpu offloading ? (Not much info on nvfp4 inference so ..)
ggonavyy@reddit (OP)
Yes, this PR just makes 50 series users with nvfp4 models prefill a LOT faster
CalligrapherFar7833@reddit
Not only 50 series but rtx 6000 / dgx spark etc
ggonavyy@reddit (OP)
Huh, didn't know dgx is sm120 nvfp4 instead of sm100... You would think if they nerfed the performance so much they would give it proper nvfp4 support
PhilippeEiffel@reddit
In fact GB10 (spark dgx) is sm121.
CalligrapherFar7833@reddit
Fake blackwell is sm120 which is anything consumer facing rtx 5xxx , rtx 6/5/4/2xxx pro , dgx spark
Glittering-Call8746@reddit
Gimped blackwell rofl fake blackwell is those gpu with no chips lol..
CalligrapherFar7833@reddit
No gpu chips are the scam blackwell :D
Xp_12@reddit
I've been calling it brownwell.
Dany0@reddit
5090 on q3.6 27b I'll get like 2500-5000-8000 tok/s prefill depending on context, quant in llamacpp
in vllm with nvfp4 I'd get 9-11k with median closer to 11k and it will not slow down much based on context
iMrParker@reddit
Curious about mixed arch setups. Let's say a blackwell and ampere card are used together, do both cards fallback to casting?
BigPoppaK78@reddit
It works, but there's zero benefit at the moment. With my 5070 Ti, I get the same speed for prompt processing at 100k: 2400 tk/s. But, token generation takes a huge hit from 65 to 30 tk/s. llama.cpp:b8967 on Fedora 43. https://i.imgur.com/VRFbPLo.png
StorageHungry8380@reddit
I'm not seeing that on my 5090, Windows 11 CUDA 13.1. However the model in both variants is larger than 16GB, so presumably you're using CPU for a few layers and that could explain it? I didn't bother downloading the Q4 variant, as I had the Q5, but here are my numbers:
BigPoppaK78@reddit
Well, yeah. He was asking about CPU offloading with the MoE model, so that's exactly what I tested.
StorageHungry8380@reddit
Alrighty, definitely time for me to go to bed. I assume NVFP4 is less CPU friendly than the standard Qn quantizations, so makes sense it's slower.
BigPoppaK78@reddit
Yeah, was always going to have an overhead penalty for the switch. But it'd be more tolerable if it was something like a 6 or 7% hit to gain 30% prompt processing. I'm sure things will improve over the next few weeks as it all gets optimized. Looking at the PR comments, they already have the next few steps in mind.
nufeen@reddit
Nice. Time to convert https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 into gguf
grumd@reddit
Like this? https://huggingface.co/jdziat/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4-GGUF
nufeen@reddit
"Requires Salamander (llama.cpp fork with NVFP4 and NemotronH LatentMoE support)." I don't know if this is compatible with the main llama.cpp
grumd@reddit
Yep you're right, the tensor shapes are not what's expected, we need new GGUFs
grumd@reddit
Downloading, we'll see
Bulky-Priority6824@reddit
nvp4 speaks the gpus native language. The blackwell tensor cores have FP4 math built directly into the silicon so the model weights go in as is and the multiplication happens without any translation step. Less overhead, faster math, same bit width.
that being said, benched vs 35B-UD_Q4_XL using dual 5060ti16's in layer in llama.cpp. results identical
hopefully llama catches up to this and unlocks some deeper potential as well as a mmproj soon
OkAwareness8446@reddit
Benched what vs 35B-UD_Q4_XL? I converted redhatai and some other nvfp4 to gguf and ran it, and i get a very noticeable difference from the q4 models
ProfessionalSpend589@reddit
nvfp4 is what the GPUs crave
Long_comment_san@reddit
thirsty gpu
ggonavyy@reddit (OP)
You have to use it with nvfp4 ggufs, q4 is still using regular mmq and mma
Bulky-Priority6824@reddit
Yep
georgeApuiu@reddit
sm 121 when ?
RedAdo2020@reddit
Okay I'm not super technical with this, but wouldn't Q8 still be better than NVFP4? Serious questuin.
Baldur-Norddahl@reddit
NVFP4 is 4 bit. Yes it is worse quality than q8. That is not the point of this format. The idea is that it is faster, much faster, than other 4 bit formats. But only if you have a GPU with support, which is currently only Nvidia 50xc and 6000 Pro.
MXFP4 is the same, just slightly worse quality.
But a few models were trained directly to these formats using QAT (quantized aware training). An example is OpenAI GPT-OSS 20b/120b. In that case MXFP4 is the original quality and much better than q8.
RedAdo2020@reddit
I think I get it. So if I was going to run 4-bit anyway, with my blackwell cards, than NVFP4 would be better quality and speed than Q4
ggonavyy@reddit (OP)
If you got a 5090 and running one of those gguf’s above, you’re prefill for those dense 30B class should be now 4000-6000 range. Generation is still capped by other overheads like vram bandwidth, but from a 5000 token system prompt for coding agent to actual generation should be instant now. Cache invalidation should feel less of a problem as well, prefilling from 50k to 70k is now like, less than 10 seconds?
quantier@reddit
The resson we have GGUFs for it is because of LM Studio….we should now get a lot more 🎉
trueimage@reddit
Does the strix halo benefit from this?
Mister__Mediocre@reddit
Could someone explain what this does and how I can use it? I have a 5060ti and use MoE models with only attention on the GPU, all experts on the CPU.
ggonavyy@reddit (OP)
Honestly I haven’t tested moe with lots of cpu offload, but first you need a nvfp4 gguf
ortegaalfredo@reddit
You have to be insane to make the investment for a Blackwell GPU and then use llama.cpp that is basically single-tread inference. But I guess this will help a lot those nvidia-spark chips.