llama.cpp - NVFP4 native support on Blackwell from now - b8967
Posted by mossy_troll_84@reddit | LocalLLaMA | View on Reddit | 34 comments
It looks like finally we have it! Time to test!!!
https://github.com/ggml-org/llama.cpp/releases/tag/b8967
ResponsibleTruck4717@reddit
Where can I find nvfp4 models, or any mxfp4-moe works?
LegacyRemaster@reddit
random test. 61.2 tokens/sec on blackwell 96gb very good. About 45 before Q4
rerri@reddit
TG is not improved by this PR tough. Only PP.
LegacyRemaster@reddit
i'm adding vulkan support to llamacpp (deepseek 4 flash) using qwen35 27B NVFP4 with "fast" PP....
mossy_troll_84@reddit (OP)
That is correct - I have tested this and here are results: https://www.reddit.com/r/LocalLLaMA/comments/1syxckc/llamacpp_benchmark_native_vs_non_native_nvfp4_on/
mossy_troll_84@reddit (OP)
but that is on Windows right? That will explain low numbers
Mount_Gamer@reddit
I've been looking forward to this.... not sure if it's me, I used a Redhat NVFP4 of the qwen3.6 35B, and converted to gguf. It was slow for token gen using RTX5060ti 16GB, as i don't fit all MOE on GPU. With a 12800 context \~ 9tg/s
mossy_troll_84@reddit (OP)
This is typical for all models which are bigger than VRAM - offload to RAM is impacting more in NVFP4, I have same now with Qwen3.5-122B-NVFP4 - aprox 10 tok/s (in native NFVP4 llama.cpp) vs. Qwen3.5-122B-Q4_K_M where I had over 20 tok/s with offload to RAM (If I remember correctly - sorry for this too many models and variants tested at once)
Mount_Gamer@reddit
Yup, I thought it might. I have been thinking of getting another rtx5060ti, but not just now, might try the smaller Qwen3.5 9B.
Formal-Exam-8767@reddit
What about Qwen3.6-35B-A3B-NVFP4?
mossy_troll_84@reddit (OP)
Formal-Exam-8767@reddit
Awesome numbers, thank you!
mossy_troll_84@reddit (OP)
no problem š
mossy_troll_84@reddit (OP)
Maybe later (sorry but I am in work, already spent my lunch on it)
cheapyx@reddit
please do it, PLEASEEEE!!! tip install tailscale on your gpu box so you can do it over ssh xD from phone.
mossy_troll_84@reddit (OP)
this is typical for all models which are bigger than VRAM - offload to RAM is impacting more in NVFP4, I have same now with Qwen3.5-122B-NVFP4 - aprox 10 tok/s (in native NFVP4 llama.cpp) vs. Qwen3.5-122B-Q4_K_M where I had over 20 tok/s with offload to RAM (If I remember correctly - sorry for this too many models and variants tested at once)
SnooDoggos9325@reddit
Benchmarks?
mossy_troll_84@reddit (OP)
Stage is yours...I will do it after my work
patricious@reddit
I am currently in the process of building a new server around the NVFP4 added support. Planning to use this model. Freenixi/Abiray-Qwen3.6-27B-NVFP4-GGUF
patricious@reddit
Here is my brief testing in a fresh context session. Data is taken directly from the raw cmd logs from the servers.
Same prompt for both runs:
Analyze the project folder structure in C:\Users\Administrator\XYZ\llama.cpp\turboquant-llamacpp-nvfp4
**Don't deploy agents, do the work yourself**
Setup: 5090, OpenCode
Models:
Freenixi/Abiray-Qwen3.6-27B-NVFP4-GGUF (via the latest llama.cpp NVFP4 support)
unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf
mossy_troll_84@reddit (OP)
Oh wow! TG is rather slow...I have aprox 70 tok/s and with -fitc 16384 (so option with defined minimum context, but f available more then this it will fill context to max in available VRAM) I have context available 207616 - maybe you set it for eg by -c 262144 then part of it landed in regular RAM and it slows down...another thing, do you see this (BLACKWELL_NATIVE_FP4 = 1)?:
patricious@reddit
Yes I have BLACKWELL_NATIVE_FP4 = 1 in the log once I start it with the fitc 16384 flag enabled. But I get a weird behavior where opencode keeps compacting the conversation endlessly, I prompted a simple message in both a empty project and a big one. Any suggestions why that might be?
Despite not getting any tangible output from the model, I could see it generate tokens at a higher pace.
mossy_troll_84@reddit (OP)
It looks like an OpenCode issue than an NVFP4 or Blackwell issue.
If
BLACKWELL_NATIVE_FP4 = 1appears in the log and tokens are generating faster, then the native FP4 path is probably working.Iād test it that way:
In OpenCode, make sure the configured context limit is not higher than the llama-server context. For example, if llama-server uses --ctx-size 32768, set OpenCode context to 32768 or lower.
Try disabling auto-compaction temporarily:
So my guess is OpenCode is getting confused by context size, compaction, or the chat/tool template. Direct llama.cpp testing should confirm that quickly.This sounds more like an OpenCode/context/template issue than an NVFP4 or Blackwell issue.If BLACKWELL_NATIVE_FP4 = 1 appears in the llama.cpp log and tokens are generating faster, then the native FP4 path is probably working.Iād test it in this order:Try the model directly with llama.cpp, without OpenCode:
curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Say hello in one short sentence."}
],
"max_tokens": 64,
"temperature": 0.2
}'
If this works normally, the model/llama.cpp/NVFP4 side is likely fine.Start llama-server with a larger context and the built-in chat template
In OpenCode, make sure the configured context limit is not higher than the llama-server context. For example, if llama-server uses --ctx-size 32768, set OpenCode context to 32768 or lower (just as an example).
Try disabling auto-compaction temporarily:"compaction": {
"auto": false,
"prune": true,
"reserved": 12000
}
Also try clearing the OpenCode session/cache, because it may keep reusing a broken/compacted conversation state.
Pablo_the_brave@reddit
VRAM is looking strange... What you think about it?
mossy_troll_84@reddit (OP)
yes, a bit cause my one look like this (FYI - RTX 5060TI was not used) with context 207616
mossy_troll_84@reddit (OP)
That's are the results on the top from that one š
Unlucky-Message8866@reddit
great! getting about +5tok/s than before!
LegacyRemaster@reddit
Ok I can test. Let me build
mossy_troll_84@reddit (OP)
already added the resuts on top
LegacyRemaster@reddit
my blackwell is @ 300W so yeah... little less . I have to push @ 600W if I want more but it's good. Testing on Vscode+kilocode now
mossy_troll_84@reddit (OP)
Ach, ok, now I get it! I am curious about your results then
WhiskyAKM@reddit
Now give me gguf of Gemma 4 / Qwen 3.6 with NVFP4
mossy_troll_84@reddit (OP)
https://huggingface.co/Freenixi/Abiray-Qwen3.6-27B-NVFP4-GGUF
About gemma4 there are some, but I have not tested it yet, actually there afre couple of models in NVFP4 and GGUF at the same time:
https://huggingface.co/models?library=gguf&sort=downloads&search=nvfp4
WhiskyAKM@reddit
Thank you very much, ill go test it now