llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged

Huh, didn't know dgx is sm120 nvfp4 instead of sm100... You would think if they nerfed the performance so much they would give it proper nvfp4 support

[-]

CalligrapherFar7833@reddit

Fake blackwell is sm120 which is anything consumer facing rtx 5xxx , rtx 6/5/4/2xxx pro , dgx spark

[-]

Glittering-Call8746@reddit

Gimped blackwell rofl fake blackwell is those gpu with no chips lol..

[-]

CalligrapherFar7833@reddit

No gpu chips are the scam blackwell :D

[-]

Dany0@reddit

5090 on q3.6 27b I'll get like 2500-5000-8000 tok/s prefill depending on context, quant in llamacpp

in vllm with nvfp4 I'd get 9-11k with median closer to 11k and it will not slow down much based on context

[-]

iMrParker@reddit

Curious about mixed arch setups. Let's say a blackwell and ampere card are used together, do both cards fallback to casting?

[-]

It works, but there's zero benefit at the moment. With my 5070 Ti, I get the same speed for prompt processing at 100k: 2400 tk/s. But, token generation takes a huge hit from 65 to 30 tk/s. llama.cpp:b8967 on Fedora 43. https://i.imgur.com/VRFbPLo.png

[-]

StorageHungry8380@reddit

I'm not seeing that on my 5090, Windows 11 CUDA 13.1. However the model in both variants is larger than 16GB, so presumably you're using CPU for a few layers and that could explain it? I didn't bother downloading the Q4 variant, as I had the Q5, but here are my numbers:

| model                          |       size |     params | backend    | ngl | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| qwen35 27B Q5_K - Medium       |  18.65 GiB |    26.90 B | CUDA       | 999 |    0 |          pp2048 |       3390.33 ± 6.81 |
| qwen35 27B Q5_K - Medium       |  18.65 GiB |    26.90 B | CUDA       | 999 |    0 |           tg128 |         63.88 ± 0.14 |
| qwen35 27B Q5_K - Medium       |  18.65 GiB |    26.90 B | CUDA       | 999 |    0 |         pp65536 |       1784.18 ± 4.54 |
| qwen35 27B Q5_K - Medium       |  18.65 GiB |    26.90 B | CUDA       | 999 |    0 |          tg2048 |         63.31 ± 0.06 |
| qwen35 27B NVFP4               |  17.50 GiB |    26.90 B | CUDA       | 999 |    0 |          pp2048 |      4853.21 ± 21.26 |
| qwen35 27B NVFP4               |  17.50 GiB |    26.90 B | CUDA       | 999 |    0 |           tg128 |         67.92 ± 0.20 |
| qwen35 27B NVFP4               |  17.50 GiB |    26.90 B | CUDA       | 999 |    0 |         pp65536 |      2123.51 ± 12.56 |
| qwen35 27B NVFP4               |  17.50 GiB |    26.90 B | CUDA       | 999 |    0 |          tg2048 |         68.28 ± 1.01 |

build: 9d34231bb (8929)

Freenixi\Abiray-Qwen3.6-27B-NVFP4.gguf
unsloth\Qwen3.6-27B-UD-Q5_K_XL.gguf

[-]

BigPoppaK78@reddit

Well, yeah. He was asking about CPU offloading with the MoE model, so that's exactly what I tested.

[-]

StorageHungry8380@reddit

Alrighty, definitely time for me to go to bed. I assume NVFP4 is less CPU friendly than the standard Qn quantizations, so makes sense it's slower.

[-]

BigPoppaK78@reddit

Yeah, was always going to have an overhead penalty for the switch. But it'd be more tolerable if it was something like a 6 or 7% hit to gain 30% prompt processing. I'm sure things will improve over the next few weeks as it all gets optimized. Looking at the PR comments, they already have the next few steps in mind.

[-]

nufeen@reddit

Nice. Time to convert https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 into gguf

[-]

grumd@reddit

Like this? https://huggingface.co/jdziat/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4-GGUF

[-]

nufeen@reddit

"Requires Salamander (llama.cpp fork with NVFP4 and NemotronH LatentMoE support)." I don't know if this is compatible with the main llama.cpp

[-]

grumd@reddit

Yep you're right, the tensor shapes are not what's expected, we need new GGUFs

[-]

grumd@reddit

Downloading, we'll see

[-]

Bulky-Priority6824@reddit

nvp4 speaks the gpus native language. The blackwell tensor cores have FP4 math built directly into the silicon so the model weights go in as is and the multiplication happens without any translation step. Less overhead, faster math, same bit width.

that being said, benched vs 35B-UD_Q4_XL using dual 5060ti16's in layer in llama.cpp. results identical

hopefully llama catches up to this and unlocks some deeper potential as well as a mmproj soon

[-]

OkAwareness8446@reddit

Benched what vs 35B-UD_Q4_XL? I converted redhatai and some other nvfp4 to gguf and ran it, and i get a very noticeable difference from the q4 models

[-]

ProfessionalSpend589@reddit

nvfp4 speaks the gpus native language

nvfp4 is what the GPUs crave

[-]

Long_comment_san@reddit

thirsty gpu

[-]

ggonavyy@reddit (OP)

You have to use it with nvfp4 ggufs, q4 is still using regular mmq and mma

[-]

Bulky-Priority6824@reddit

Yep

[-]

georgeApuiu@reddit

sm 121 when ?

[-]

RedAdo2020@reddit

Okay I'm not super technical with this, but wouldn't Q8 still be better than NVFP4? Serious questuin.

[-]

Baldur-Norddahl@reddit

NVFP4 is 4 bit. Yes it is worse quality than q8. That is not the point of this format. The idea is that it is faster, much faster, than other 4 bit formats. But only if you have a GPU with support, which is currently only Nvidia 50xc and 6000 Pro.

MXFP4 is the same, just slightly worse quality.

But a few models were trained directly to these formats using QAT (quantized aware training). An example is OpenAI GPT-OSS 20b/120b. In that case MXFP4 is the original quality and much better than q8.

[-]

RedAdo2020@reddit

I think I get it. So if I was going to run 4-bit anyway, with my blackwell cards, than NVFP4 would be better quality and speed than Q4

[-]

ggonavyy@reddit (OP)

If you got a 5090 and running one of those gguf’s above, you’re prefill for those dense 30B class should be now 4000-6000 range. Generation is still capped by other overheads like vram bandwidth, but from a 5000 token system prompt for coding agent to actual generation should be instant now. Cache invalidation should feel less of a problem as well, prefilling from 50k to 70k is now like, less than 10 seconds?

[-]

ggonavyy@reddit (OP)

Honestly I haven’t tested moe with lots of cpu offload, but first you need a nvfp4 gguf

[-]

ortegaalfredo@reddit

You have to be insane to make the investment for a Blackwell GPU and then use llama.cpp that is basically single-tread inference. But I guess this will help a lot those nvidia-spark chips.