llama.cpp - NVFP4 native support on Blackwell from now - b8967

I've been looking forward to this.... not sure if it's me, I used a Redhat NVFP4 of the qwen3.6 35B, and converted to gguf. It was slow for token gen using RTX5060ti 16GB, as i don't fit all MOE on GPU. With a 12800 context \~ 9tg/s

[-]

mossy_troll_84@reddit (OP)

This is typical for all models which are bigger than VRAM - offload to RAM is impacting more in NVFP4, I have same now with Qwen3.5-122B-NVFP4 - aprox 10 tok/s (in native NFVP4 llama.cpp) vs. Qwen3.5-122B-Q4_K_M where I had over 20 tok/s with offload to RAM (If I remember correctly - sorry for this too many models and variants tested at once)

[-]

Mount_Gamer@reddit

Yup, I thought it might. I have been thinking of getting another rtx5060ti, but not just now, might try the smaller Qwen3.5 9B.

[-]

Formal-Exam-8767@reddit

What about Qwen3.6-35B-A3B-NVFP4?

[-]

mossy_troll_84@reddit (OP)

[-]

Formal-Exam-8767@reddit

Awesome numbers, thank you!

[-]

mossy_troll_84@reddit (OP)

no problem 😄

[-]

mossy_troll_84@reddit (OP)

Maybe later (sorry but I am in work, already spent my lunch on it)

[-]

cheapyx@reddit

please do it, PLEASEEEE!!! tip install tailscale on your gpu box so you can do it over ssh xD from phone.

[-]

mossy_troll_84@reddit (OP)

this is typical for all models which are bigger than VRAM - offload to RAM is impacting more in NVFP4, I have same now with Qwen3.5-122B-NVFP4 - aprox 10 tok/s (in native NFVP4 llama.cpp) vs. Qwen3.5-122B-Q4_K_M where I had over 20 tok/s with offload to RAM (If I remember correctly - sorry for this too many models and variants tested at once)

[-]

SnooDoggos9325@reddit

Benchmarks?

[-]

mossy_troll_84@reddit (OP)

Stage is yours...I will do it after my work

[-]

patricious@reddit

I am currently in the process of building a new server around the NVFP4 added support. Planning to use this model. Freenixi/Abiray-Qwen3.6-27B-NVFP4-GGUF

[-]

patricious@reddit

Here is my brief testing in a fresh context session. Data is taken directly from the raw cmd logs from the servers.

Same prompt for both runs:
Analyze the project folder structure in C:\Users\Administrator\XYZ\llama.cpp\turboquant-llamacpp-nvfp4
**Don't deploy agents, do the work yourself**

Setup: 5090, OpenCode

Models:
Freenixi/Abiray-Qwen3.6-27B-NVFP4-GGUF (via the latest llama.cpp NVFP4 support)
unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf

Generation Speed

Context Depth	Q4_K_XL	NVFP4	Winner
\~50K (early)	27.7 tok/s	27.1 tok/s	Tie
\~60K (mid)	24.5 tok/s	21.2 tok/s	Q4_K_XL +16%
\~70K (mid-late)	22.3 tok/s	20.3 tok/s	Q4_K_XL +10%
\~100K (very late)	18.4 tok/s	—	Q4_K_XL (NVFP4 never reached this)
Long gen (1800 tok)	17.7 tok/s	—	Q4_K_XL
Very long gen (2565 tok)	—	11.3 tok/s	—

Degradation Curve

Metric	Q4_K_XL	NVFP4
Early → Late degradation	27.7 → 18.4 (-34%)	27.1 → 11.3 (-58%)
Degradation slope	\~0.18 tok/s per 10K context	\~0.45 tok/s per 10K context
Stability at long gen	Holds steady, minimal additional drop	Collapses dramatically

Prompt Evaluation Speed

Prompt Size	Q4_K_XL	NVFP4	Winner
Cold start (\~50K)	3,068 tok/s	2,540 tok/s	Q4_K_XL +21%
Small (<200 tok)	600–800 tok/s	670–1,015 tok/s	Tie / slight NVFP4
Medium (300–1500)	1,200–1,900 tok/s	1,000–1,950 tok/s	Tie
Large (2000+)	1,750–2,175 tok/s	1,386–1,871 tok/s	Q4_K_XL +15%

VRAM Efficiency

	Q4_K_XL	NVFP4
Model size	16.4 GiB	17.5 GiB
BPW	5.24	5.59
Peak VRAM	29 GB / 32.6 GB	\~22 GB
Headroom	Near zero	\~10 GB

[-]

mossy_troll_84@reddit (OP)

Oh wow! TG is rather slow...I have aprox 70 tok/s and with -fitc 16384 (so option with defined minimum context, but f available more then this it will fill context to max in available VRAM) I have context available 207616 - maybe you set it for eg by -c 262144 then part of it landed in regular RAM and it slows down...another thing, do you see this (BLACKWELL_NATIVE_FP4 = 1)?:

[-]

patricious@reddit

Yes I have BLACKWELL_NATIVE_FP4 = 1 in the log once I start it with the fitc 16384 flag enabled. But I get a weird behavior where opencode keeps compacting the conversation endlessly, I prompted a simple message in both a empty project and a big one. Any suggestions why that might be?

Despite not getting any tangible output from the model, I could see it generate tokens at a higher pace.

[-]

mossy_troll_84@reddit (OP)

It looks like an OpenCode issue than an NVFP4 or Blackwell issue.

If BLACKWELL_NATIVE_FP4 = 1 appears in the log and tokens are generating faster, then the native FP4 path is probably working.

I’d test it that way:

In OpenCode, make sure the configured context limit is not higher than the llama-server context. For example, if llama-server uses --ctx-size 32768, set OpenCode context to 32768 or lower.

Try disabling auto-compaction temporarily:

"compaction": {
  "auto": false,
  "prune": true,
  "reserved": 12000
}

Also try clearing the OpenCode session/cache, because it may keep reusing a broken/compacted conversation state.

So my guess is OpenCode is getting confused by context size, compaction, or the chat/tool template. Direct llama.cpp testing should confirm that quickly.This sounds more like an OpenCode/context/template issue than an NVFP4 or Blackwell issue.If BLACKWELL_NATIVE_FP4 = 1 appears in the llama.cpp log and tokens are generating faster, then the native FP4 path is probably working.I’d test it in this order:Try the model directly with llama.cpp, without OpenCode:

curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Say hello in one short sentence."}
],
"max_tokens": 64,
"temperature": 0.2
}'
If this works normally, the model/llama.cpp/NVFP4 side is likely fine.Start llama-server with a larger context and the built-in chat template
In OpenCode, make sure the configured context limit is not higher than the llama-server context. For example, if llama-server uses --ctx-size 32768, set OpenCode context to 32768 or lower (just as an example).

Try disabling auto-compaction temporarily:"compaction": {
"auto": false,
"prune": true,
"reserved": 12000
}
Also try clearing the OpenCode session/cache, because it may keep reusing a broken/compacted conversation state.

[-]

Pablo_the_brave@reddit

VRAM is looking strange... What you think about it?

[-]

mossy_troll_84@reddit (OP)

yes, a bit cause my one look like this (FYI - RTX 5060TI was not used) with context 207616

[-]

mossy_troll_84@reddit (OP)

That's are the results on the top from that one 😄

[-]

Unlucky-Message8866@reddit

great! getting about +5tok/s than before!

[-]

LegacyRemaster@reddit

Ok I can test. Let me build

[-]

mossy_troll_84@reddit (OP)

already added the resuts on top

[-]

LegacyRemaster@reddit

my blackwell is @ 300W so yeah... little less . I have to push @ 600W if I want more but it's good. Testing on Vscode+kilocode now

[-]

mossy_troll_84@reddit (OP)

Ach, ok, now I get it! I am curious about your results then

[-]

WhiskyAKM@reddit

Now give me gguf of Gemma 4 / Qwen 3.6 with NVFP4

[-]

mossy_troll_84@reddit (OP)

https://huggingface.co/Freenixi/Abiray-Qwen3.6-27B-NVFP4-GGUF

About gemma4 there are some, but I have not tested it yet, actually there afre couple of models in NVFP4 and GGUF at the same time:
https://huggingface.co/models?library=gguf&sort=downloads&search=nvfp4

[-]

WhiskyAKM@reddit

Thank you very much, ill go test it now