Hey, has anyone here used Qwen3.5-27B-NVFP4-GGUF with llama.cpp yet?
Posted by mossy_troll_84@reddit | LocalLLaMA | View on Reddit | 26 comments
Hey!
I was wondering if anyone of you have used Qwen3.5-27B-NVFP4-GGUF on RTX5090 on llama.cpp? I have downloaded and tested today Freenixi/AxionML-Qwen3.5-27B-NVFP4-GGUF and it's quire impressive (quality of answers and deffinatelly beter in non-english langauges) Also what was your speed on llama.cpp? Just asking out of curiosity. Please share your experience. Thanks!
paq85@reddit
AFAIK you need to use vllm to really use nvfp4. I've spend quiet some time to make models like that run... And didn't notice any better performance... But maybe I misscofigured it or something....
mossy_troll_84@reddit (OP)
With llama.cpp Q4_K_M vs. NVFP4 on my side speed was 74 kok/sec vs. 72 tok/sec - but quality of the answer, complicate questions - maths, coding, logic, Polish language etc that is something NVFP4 amaze me...of course I did it to check if llama.cpp is already support NVFP4 at all on my Blackwell, but I am positively surprised.
andy2na@reddit
llama.cpp only supports nvfp4 on cpu, not cuda, yet
paq85@reddit
Yeah... So the Blackwell (rtx 5090) is not used at all...
BobbyL2k@reddit
Yes, but TG is bottleneck by memory bandwidth anyway, so OP is getting the 4-bit size with the added accuracy of NVFP4.
PP would be faster with proper native support.
mossy_troll_84@reddit (OP)
just a screenshot from NVtop
mossy_troll_84@reddit (OP)
really?
andy2na@reddit
What is this showing? Only merged nvfp4 in main branch has been CPU support
mossy_troll_84@reddit (OP)
GPU usage
andy2na@reddit
where did you get a gguf of a nvfp4 quant for gemma-4-27B? I will test it also
mossy_troll_84@reddit (OP)
It's not gemma, it's qwen3.5-27B. Here is the the link:
https://huggingface.co/Freenixi/AxionML-Qwen3.5-27B-NVFP4-GGUF
andy2na@reddit
tested it out, around 34t/s generation on my 3090 vs IQ4 quant at 39t/s
I know 3090 doesnt have NVFP4 compatibility (my 5060ti 16gb wont fit that model), but if the quality of NVFP4 is truly near lossless then even the speed drop of 5t/s is worth it but some nvfp4 benchmarks show that its not better than Q4_K_M: https://huggingface.co/krampenschiesser/Qwen3.5-35B-A3B-NVFP4.gguf
mossy_troll_84@reddit (OP)
Thank you for your feedback! Need to read that. :)
paq85@reddit
It is using GPU, but it most likely can't use the Blackwell hardware support for nvfp4 to make it run even faster.
"llama.cpp does not yet provide mature, native NVFP4 execution; community PRs and issues show work is underway to add NVFP4 dequantization and kernels, but it’s not yet mainstream. If you want to run NVFP4 models today with hardware acceleration, llama.cpp is not the recommended path."
When will we support NVFP4? · ggml-org/llama.cpp · Discussion #16668 · GitHub
Pozdrawiam ;)
emprahsFury@reddit
That link is from half a year ago. If you look at the HF model it's clearly in NVFP4 and llama.cpp is happy to put it on a cuda device.
paq85@reddit
Could you show an example? All nvfp4 models I have seen tell to use LLMV
mossy_troll_84@reddit (OP)
here are small (but existing) list of NVFP4 in GGUF format for llama.cpp:
https://huggingface.co/models?library=gguf&apps=llama.cpp&sort=trending&search=nvfp4
mossy_troll_84@reddit (OP)
With that I can I can agree that this is not full implementation yet with full possible speed etc....it's just a fact that it works, with almost same speed it's a big thing!
Easy_Apricot_46@reddit
I'm using this llama.cpp branch (Blackwell native NVFP4 support), not pushed to main yet: https://github.com/ggml-org/llama.cpp/pull/21896
It's a pretty meaningful speedup in prompt processing.
Dany0@reddit
I tried it with vllm+dflash and despite patching it to work it oom'd. NVFP4 != Q4, it actually has some layers at full precision iirc. I guess I could give it another try without DFlash
StardockEngineer@reddit
Are you running vulkan? I see this b8785 │ 2026-04-14 │ Vulkan: Support for GGML_TYPE_NVFP4 (nvfp4 quantization)
mossy_troll_84@reddit (OP)
It's CUDA, I have never used one with Vulkan. Normally during compilation I am using only this command:
export CUDACXX=/opt/cuda/bin/nvcc
export CMAKE_CUDA_COMPILER=/opt/cuda/bin/nvcc
export CUDAToolkit_ROOT=/opt/cuda
export CUDA_HOME=/opt/cuda
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_NATIVE=ON \
-DGGML_OPENMP=ON \
-DGGML_LTO=ON \
-DGGML_CUDA=ON \
-DGGML_CUDA_GRAPHS=ON \
-DGGML_CUDA_FA=ON \
-DGGML_CUDA_FA_ALL_QUANTS=ON \
-DCMAKE_CUDA_ARCHITECTURES="120a" \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_STATIC=ON
cmake --build build -j"$(nproc)"
HopePupal@reddit
i have numbers for that model in NVFP4 in vLLM on an RTX PRO 4500 (not a 5090, but basically a 5080 with double VRAM): 4.6k t/s PP, 8.4 t/s TG at 8k context, degrading to 3.5k PP at 64k context, same TG. so that's your floor, a 5090 should be able to improve on that.
does llama.cpp even support NVFP4, though?
mossy_troll_84@reddit (OP)
mossy_troll_84@reddit (OP)
mossy_troll_84@reddit (OP)
Yes, it works, I was suprised. There are only couple models in Huggingface in GGUF and NVFP4 at the same time. I will need do check later PP, but context and TG looks good to me.