FP4 inference in llama.cpp (NVFP4) and ik_llama.cpp (MXFP4) landed - Finally
Posted by Usual-Carrot6352@reddit | LocalLLaMA | View on Reddit | 53 comments
Both llama.cpp and ik_llama.cpp now have FP4 support — but with different flavors worth knowing about.
llama.cpp recently merged NVFP4 (Nvidia's block-scaled FP4, `GGML_TYPE_NVFP4 = 40`), with CUDA kernels landing in `mmq.cuh`, `mmvq.cu`, `convert.cu` and others.
ik_llama.cpp has had MXFP4 (`GGML_TYPE_MXFP4 = 39`) since PR #682 — the MX-standard FP4 used in Meta's gpt-oss models. Coverage is actually broader: CPU (AVX2, NEON, Zen4), CUDA, are all implemented.
They're not the same wire format — NVFP4 is Nvidia-specific E4M3 with block scaling, MXFP4 follows the MX consortium standard — but both land in the 4-bit float regime and should bring meaningful VRAM savings once model support catches up.
Verified by grepping both repos locally today.
My specs: 5090(24GB VRAM)
Go grab and play with models:
https://huggingface.co/models?num_parameters=min:0,max:64B&sort=modified&search=NVFP4
Personal favorite ones:
- Abiray-Qwen3.6-27B-NVFP4
- Qwen3-1.7B-NVFP4A16
- Qwen3.5-2B-NVFP4
- gemma-4-31B-it-NVFP4-turbo-GGUF
- Qwen3-0.6B-FP4
Exciting times for quantization.
x8code@reddit
This is HUGE! I've been waiting for this. I'm running an RTX 5080 + 5060 Ti 16 GB, for a total of 32 GB (minus Win11 overhead) VRAM.
I really want to try out NVFP4 models, because they're supposed to perform ridiculously well on Blackwell GPUs.
Bubbly-Staff-9452@reddit
What do you think about your GPU setup? I’ve thought about GPU going the same route. I bought a 5080 for gaming before I got into LLM hosting and now I don’t know if I should get a 5060 Ti to go along with it or just sell my 5080 and buy the cheapest 5090 I can. I know what would be better/more convenient but it’s so much money for a 5090.
x8code@reddit
And yeah, don't waste your money paying double on an RTX 5090. Just get dual RTX 5080 or something like that.
oxygen_addiction@reddit
You can literally get a 5090 for the price of 2x 5080 (at least in Europe) and the memory bandwith of the 5090 is twice that of the 5080, so token generation will be way faster across the board.
If you can get a 5090, do that, as you will be able to get more for cheaper in the future or add RTX6000, etc.
It's way more future proof due to the high bandwith + high VRAM combo.
x8code@reddit
2x RTX 5080 = $2400 vs. 1x RTX 5090 = $3800 if you're lucky
Explain how that is the "same" price?
oxygen_addiction@reddit
Ah, my bad. You are 100% right.
For some reason I thought the 5080RTX was way more expensive now. Found them for around 1189E on MindFactory, so your numbers are accurate. Cheers.
x8code@reddit
Respect. Cheers mate
oxygen_addiction@reddit
Much love.
x8code@reddit
It's freaking awesome. I was actually thinking about buying another RTX 5060 Ti 16 GB or RTX 5080, so I could go a bit higher. I would have to move my 5060 Ti 16 GB outside the case and have two connected via Oculink though. Only the primary 5080 would fit in the case / PCIe slots. I would have to replace the secondary GPU with an Oculink PCIe card.
No_Conversation9561@reddit
Did you get any speed up? I have the same setup as yours and don’t see much or any improvement.
x8code@reddit
I haven't tried yet unfortunately. Just haven't had time yet.
HopePupal@reddit
you can do that today with vLLM, which already supports NVFP4
Usual-Carrot6352@reddit (OP)
The output quality is amazing using default settings.
x8code@reddit
Isn't it more about speed though? I was watching a video on Aleksander Ziskind's channel, and NVFP4 seemed to multiply performance many times over, eg. 60 --> 1200 TPS.
Pineapple_King@reddit
Jesus christ, cant you just write what any of this chinese means?? what is NVFP4 and why would i want it?
ResidentPositive4122@reddit
bruh... that's why we check what the LLMs output!
Usual-Carrot6352@reddit (OP)
I must be honest I quickly vibe-prep that post via Opus4.6
LetsGoBrandon4256@reddit
Imagine coming here to learn about new stuff then realize OP is posting unchecked AI slop.
Usual-Carrot6352@reddit (OP)
Don't gas light here nothing is OP, a word corrected.
LetsGoBrandon4256@reddit
Yeah sure thing.
Usual-Carrot6352@reddit (OP)
Go. Play someonewhere else kid
LetsGoBrandon4256@reddit
lol are you seriously getting upset because you got called out for slop-posting?
Bootes-sphere@reddit
NVFP4 and MXFP4 landing simultaneously is huge — this is the kind of infrastructure progress that actually matters for local inference. The quantization wars are finally settling into something practical.
Real talk though: FP4 is still early enough that you'll see variance in quality depending on model architecture. Some layers handle it better than others. If you're experimenting, start with smaller models first (7B range) to see if the perplexity hit is acceptable for your use case, then scale up.
The memory savings are legit — you're looking at roughly 2x reduction over FP8 on consumer GPUs. But don't expect "just works" across every model yet. Torch compatibility and kernel optimization are still catching up.
Are you planning to test this on specific hardware, or just curious about the general capability? The hardware matters *a lot* for how these quantizations actually perform.
Bootes-sphere@reddit
This is huge for local inference efficiency! FP4 quantization hitting llama.cpp means you can run significantly larger models on consumer hardware with minimal quality loss. The speed improvements should be noticeable too, especially on older GPUs that struggle with standard precision formats. If you're experimenting with different model sizes and providers to find your sweet spot, tools like our AI Leak Checker (aisecuritygateway.ai/ai-leak-checker) can help you safely test prompts without accidentally leaking sensitive data during benchmarking.
Dany0@reddit
FYI This is just compatibility, not speedup, yet. There is also a known bug causing a tiny 2% ppl loss
Real nvfp4 with speedup is coming
Dany0@reddit
By speedup I mean leveraging blackwell's hw level support. Right now it just gets converted to fp16 on inference
_FlyingWhales@reddit
If you have a *real* blackwell card with TMEM etc. the chances are that you are not going to use llama.cpp anyway.
Dany0@reddit
My 5090 is just as real blackwell as any other no Jensen will take that away from me 😠😡
MelodicRecognition7@reddit
but he took already...
Dany0@reddit
I am more powerful than Jensen, the jacket is a monkey paw curse. I have something jensen will never have, love
_FlyingWhales@reddit
Lol ok buddy
Usual-Carrot6352@reddit (OP)
share more plz. thank u
Dany0@reddit
You get the vram savings + ability to run those models on vulkan and non blackwell cuda, but not speedup on blackwell gpus
Usual-Carrot6352@reddit (OP)
thanks! you're right just checked:
NOT getting true FP4 compute
✔️ Getting FP4 storage + dequantization → higher precision compute
Dany0@reddit
Right I checked out the PR again, and you do actually get a speed boost but only on prefill/prompt processing and only on 30-40k ctx+
Usual-Carrot6352@reddit (OP)
hmm
HopePupal@reddit
any idea if the llamas are planning FP8 paths for Ada and RDNA 4 cards?
rerri@reddit
With this PR there is a significant speedup for PP on Blackwell cards when using NVFP4 quants:
https://github.com/ggml-org/llama.cpp/pull/22196
On a 5090 the gains I've seen are about 66% with Gemma 4 31B, a little over 50% with Qwen 27B and with something like 33% with Qwen 35B MoE.
Hoping a PR for TG acceleration will land soon.
FullOf_Bad_Ideas@reddit
how's the impact on quality like?
Quantizing activations to 4-bits has usually a VERY degrading effect on quality, so I'd assume that unless QAT is done, model will be unusably bad.
Benchmarks on linked AxionML/Qwen3.5-2B-NVFP4 look totally fake - probably just an original number scaled by a certain factor to appear realistic, but I really doubt they ran all of those benchmarks.
Benchmarks on small Qwen 3 1.7B NVFP A16, which doesn't get the speed benefit as it still uses 16-bit activations, look pretty bad.
And 4-bit activation quant would be even worse. It would be awesome if someone with this hardware would do PPL and KLD testing of those quants when ran with W4A4 scheme, it's probably already done and buried somewhere on Github
MelodicRecognition7@reddit
I think there is no point to quantize 27-31B models to 4 bit but this might be beneficial for 200B+ models like MiniMax or larger Qwens. Or Kimi if someone is going to re-quantize it
FullOf_Bad_Ideas@reddit
I actually found someone making KLD measurements of bigger qwens, including NVFP4.
https://old.reddit.com/r/LocalLLaMA/comments/1roz3yl/if_youre_using_nvidias_nvfp4_of_qwen35397_try_a/
Nvidia/Qwen3.5-397B-A17B-NVFP4 has KLD of about 0.11
That's very similar to my 2.5bpw EXL3 quant - https://huggingface.co/cpral/Qwen3.5-397B-A17B-exl3 - which is obviously not great but if you don't have VRAM for a bigger quant it's fine.
NVFP4 as it's sold for those quants is a marketing trick.
Read the thread and comments by OP - you need additional training to make quants good, and that's a thousand USD for a 120B model. So probably 10-100k USD to make good Qwen/Minimax NVFP4 quant....
In llama.cpp thread about NVFP4 ggerganov did claim it's mostly something for use with models trained in NVFP4, not converted to NVFP4, and I think people miss it and jump on it blindly.
marscarsrars@reddit
Absolutely delicious.
Purple-Programmer-7@reddit
For anyone reading this who is contributing to llama.cpp, THANK YOU! 🙏
Chance_Value_Not@reddit
Why do you have 24gb vram on your 5090
Usual-Carrot6352@reddit (OP)
becaude i have two ba**s :p
Chance_Value_Not@reddit
One ball if its a 5090M ;)
InformationSweet808@reddit
Good breakdown. So right now it’s basically VRAM savings > speed, with prefill gains only kicking in at high ctx. The real question is: once kernels mature, does NVFP4 actually beat MXFP4 in end-to-end latency or is this just Nvidia lock-in with marginal upside?
roxoholic@reddit
Any head to head ppl comparison of nvfp4, q4, q8 and fp8?
smart4@reddit
Whats the difference of this and Q4 gguf files?
jacek2023@reddit
why your home is blurred?
Usual-Carrot6352@reddit (OP)
I did it myself :D
jacek2023@reddit
congratulations on your blurring skills, but why?
Usual-Carrot6352@reddit (OP)
haha I must be honest. I am quite senior and I can't reveal my identity.