FP4 inference in llama.cpp (NVFP4) and ik_llama.cpp (MXFP4) landed - Finally

Posted by Usual-Carrot6352@reddit | LocalLLaMA | View on Reddit | 53 comments

Both llama.cpp and ik_llama.cpp now have FP4 support — but with different flavors worth knowing about.

llama.cpp recently merged NVFP4 (Nvidia's block-scaled FP4, `GGML_TYPE_NVFP4 = 40`), with CUDA kernels landing in `mmq.cuh`, `mmvq.cu`, `convert.cu` and others.

ik_llama.cpp has had MXFP4 (`GGML_TYPE_MXFP4 = 39`) since PR #682 — the MX-standard FP4 used in Meta's gpt-oss models. Coverage is actually broader: CPU (AVX2, NEON, Zen4), CUDA, are all implemented.

They're not the same wire format — NVFP4 is Nvidia-specific E4M3 with block scaling, MXFP4 follows the MX consortium standard — but both land in the 4-bit float regime and should bring meaningful VRAM savings once model support catches up.

Verified by grepping both repos locally today.

My specs: 5090(24GB VRAM)

Go grab and play with models:
https://huggingface.co/models?num_parameters=min:0,max:64B&sort=modified&search=NVFP4

Personal favorite ones:
- Abiray-Qwen3.6-27B-NVFP4
- Qwen3-1.7B-NVFP4A16
- Qwen3.5-2B-NVFP4
- gemma-4-31B-it-NVFP4-turbo-GGUF
- Qwen3-0.6B-FP4

Exciting times for quantization.

[-]

x8code@reddit

This is HUGE! I've been waiting for this. I'm running an RTX 5080 + 5060 Ti 16 GB, for a total of 32 GB (minus Win11 overhead) VRAM.

I really want to try out NVFP4 models, because they're supposed to perform ridiculously well on Blackwell GPUs.

[-]

Bubbly-Staff-9452@reddit

What do you think about your GPU setup? I’ve thought about GPU going the same route. I bought a 5080 for gaming before I got into LLM hosting and now I don’t know if I should get a 5060 Ti to go along with it or just sell my 5080 and buy the cheapest 5090 I can. I know what would be better/more convenient but it’s so much money for a 5090.

[-]

x8code@reddit

And yeah, don't waste your money paying double on an RTX 5090. Just get dual RTX 5080 or something like that.

[-]

oxygen_addiction@reddit

You can literally get a 5090 for the price of 2x 5080 (at least in Europe) and the memory bandwith of the 5090 is twice that of the 5080, so token generation will be way faster across the board.

If you can get a 5090, do that, as you will be able to get more for cheaper in the future or add RTX6000, etc.

It's way more future proof due to the high bandwith + high VRAM combo.

[-]

x8code@reddit

2x RTX 5080 = $2400 vs. 1x RTX 5090 = $3800 if you're lucky

Explain how that is the "same" price?

[-]

oxygen_addiction@reddit

Ah, my bad. You are 100% right.

For some reason I thought the 5080RTX was way more expensive now. Found them for around 1189E on MindFactory, so your numbers are accurate. Cheers.

[-]

x8code@reddit

Respect. Cheers mate

[-]

oxygen_addiction@reddit

Much love.

[-]

x8code@reddit

It's freaking awesome. I was actually thinking about buying another RTX 5060 Ti 16 GB or RTX 5080, so I could go a bit higher. I would have to move my 5060 Ti 16 GB outside the case and have two connected via Oculink though. Only the primary 5080 would fit in the case / PCIe slots. I would have to replace the secondary GPU with an Oculink PCIe card.

[-]

No_Conversation9561@reddit

Did you get any speed up? I have the same setup as yours and don’t see much or any improvement.

[-]

x8code@reddit

I haven't tried yet unfortunately. Just haven't had time yet.

[-]

HopePupal@reddit

you can do that today with vLLM, which already supports NVFP4

[-]

Usual-Carrot6352@reddit (OP)

The output quality is amazing using default settings.

[-]

x8code@reddit

Isn't it more about speed though? I was watching a video on Aleksander Ziskind's channel, and NVFP4 seemed to multiply performance many times over, eg. 60 --> 1200 TPS.

[-]

Pineapple_King@reddit

Jesus christ, cant you just write what any of this chinese means?? what is NVFP4 and why would i want it?

[-]

ResidentPositive4122@reddit

used in Meta's gpt-oss models.

bruh... that's why we check what the LLMs output!

[-]

Usual-Carrot6352@reddit (OP)

I must be honest I quickly vibe-prep that post via Opus4.6

[-]

LetsGoBrandon4256@reddit

Imagine coming here to learn about new stuff then realize OP is posting unchecked AI slop.

[-]

Usual-Carrot6352@reddit (OP)

Don't gas light here nothing is OP, a word corrected.

[-]

LetsGoBrandon4256@reddit

Meta's gpt-oss models

Yeah sure thing.

[-]

Usual-Carrot6352@reddit (OP)

Go. Play someonewhere else kid

[-]

LetsGoBrandon4256@reddit

lol are you seriously getting upset because you got called out for slop-posting?

[-]

Bootes-sphere@reddit

NVFP4 and MXFP4 landing simultaneously is huge — this is the kind of infrastructure progress that actually matters for local inference. The quantization wars are finally settling into something practical.

Real talk though: FP4 is still early enough that you'll see variance in quality depending on model architecture. Some layers handle it better than others. If you're experimenting, start with smaller models first (7B range) to see if the perplexity hit is acceptable for your use case, then scale up.

The memory savings are legit — you're looking at roughly 2x reduction over FP8 on consumer GPUs. But don't expect "just works" across every model yet. Torch compatibility and kernel optimization are still catching up.

Are you planning to test this on specific hardware, or just curious about the general capability? The hardware matters *a lot* for how these quantizations actually perform.

[-]

Bootes-sphere@reddit

This is huge for local inference efficiency! FP4 quantization hitting llama.cpp means you can run significantly larger models on consumer hardware with minimal quality loss. The speed improvements should be noticeable too, especially on older GPUs that struggle with standard precision formats. If you're experimenting with different model sizes and providers to find your sweet spot, tools like our AI Leak Checker (aisecuritygateway.ai/ai-leak-checker) can help you safely test prompts without accidentally leaking sensitive data during benchmarking.

[-]

Dany0@reddit

FYI This is just compatibility, not speedup, yet. There is also a known bug causing a tiny 2% ppl loss

Real nvfp4 with speedup is coming

[-]

Dany0@reddit

By speedup I mean leveraging blackwell's hw level support. Right now it just gets converted to fp16 on inference

[-]

_FlyingWhales@reddit

If you have a *real* blackwell card with TMEM etc. the chances are that you are not going to use llama.cpp anyway.

[-]

Dany0@reddit

My 5090 is just as real blackwell as any other no Jensen will take that away from me 😠😡

[-]

MelodicRecognition7@reddit

but he took already...

[-]

Dany0@reddit

I am more powerful than Jensen, the jacket is a monkey paw curse. I have something jensen will never have, love

[-]

_FlyingWhales@reddit

Lol ok buddy

[-]

Usual-Carrot6352@reddit (OP)

share more plz. thank u

[-]

Dany0@reddit

You get the vram savings + ability to run those models on vulkan and non blackwell cuda, but not speedup on blackwell gpus

[-]

Usual-Carrot6352@reddit (OP)

thanks! you're right just checked:
NOT getting true FP4 compute
✔️ Getting FP4 storage + dequantization → higher precision compute

[-]

Dany0@reddit

Right I checked out the PR again, and you do actually get a speed boost but only on prefill/prompt processing and only on 30-40k ctx+

[-]

Usual-Carrot6352@reddit (OP)

hmm

[-]

HopePupal@reddit

any idea if the llamas are planning FP8 paths for Ada and RDNA 4 cards?

[-]

rerri@reddit

With this PR there is a significant speedup for PP on Blackwell cards when using NVFP4 quants:

https://github.com/ggml-org/llama.cpp/pull/22196

On a 5090 the gains I've seen are about 66% with Gemma 4 31B, a little over 50% with Qwen 27B and with something like 33% with Qwen 35B MoE.

Hoping a PR for TG acceleration will land soon.

[-]

FullOf_Bad_Ideas@reddit

how's the impact on quality like?

Quantizing activations to 4-bits has usually a VERY degrading effect on quality, so I'd assume that unless QAT is done, model will be unusably bad.

Benchmarks on linked AxionML/Qwen3.5-2B-NVFP4 look totally fake - probably just an original number scaled by a certain factor to appear realistic, but I really doubt they ran all of those benchmarks.

Benchmarks on small Qwen 3 1.7B NVFP A16, which doesn't get the speed benefit as it still uses 16-bit activations, look pretty bad.

| Category | Metric | Qwen/Qwen3-1.7B | Qwen3-1.7B-NVFP4A16 (this model) | Recovery (%) |
|----------|--------|-----------------|-----------------------------------|--------------|
| General Knowledge | MMLU-Redux (T) | NA | 65.73% | NA |
| General Knowledge | MMLU-Redux | 64.4% | 55.23% | 85.8% |
| Mathematical Reasoning | Math500 (T) | 93.4% | 89.6% | 95.9% |
| Mathematical Reasoning | Math500 | 73% | 70% | 95.9% |
| Instruction Following | IFEval(Strict Prompt Level Acc) | 68.2% | 66.17% | 97.0% |
| Long Context | RULER-NIAH-32k | NA | 76.21% | NA |
| Coding | LiveCodeBench (2410-2502)(T) Pass@1 | 33.2% | 29.75% | 89.6% |
| Coding | LiveCodeBench (2410-2502) Pass@1 | 11.6% | 6.25% | 53.8% |

And 4-bit activation quant would be even worse. It would be awesome if someone with this hardware would do PPL and KLD testing of those quants when ran with W4A4 scheme, it's probably already done and buried somewhere on Github

[-]

MelodicRecognition7@reddit

I think there is no point to quantize 27-31B models to 4 bit but this might be beneficial for 200B+ models like MiniMax or larger Qwens. Or Kimi if someone is going to re-quantize it

[-]

FullOf_Bad_Ideas@reddit

I actually found someone making KLD measurements of bigger qwens, including NVFP4.

https://old.reddit.com/r/LocalLLaMA/comments/1roz3yl/if_youre_using_nvidias_nvfp4_of_qwen35397_try_a/

Nvidia/Qwen3.5-397B-A17B-NVFP4 has KLD of about 0.11

That's very similar to my 2.5bpw EXL3 quant - https://huggingface.co/cpral/Qwen3.5-397B-A17B-exl3 - which is obviously not great but if you don't have VRAM for a bigger quant it's fine.

NVFP4 as it's sold for those quants is a marketing trick.

Read the thread and comments by OP - you need additional training to make quants good, and that's a thousand USD for a 120B model. So probably 10-100k USD to make good Qwen/Minimax NVFP4 quant....

In llama.cpp thread about NVFP4 ggerganov did claim it's mostly something for use with models trained in NVFP4, not converted to NVFP4, and I think people miss it and jump on it blindly.

[-]

marscarsrars@reddit

Absolutely delicious.

[-]

Purple-Programmer-7@reddit

For anyone reading this who is contributing to llama.cpp, THANK YOU! 🙏

[-]

Chance_Value_Not@reddit

Why do you have 24gb vram on your 5090

[-]

Usual-Carrot6352@reddit (OP)

becaude i have two ba**s :p

[-]

Chance_Value_Not@reddit

One ball if its a 5090M ;)

[-]

InformationSweet808@reddit

Good breakdown. So right now it’s basically VRAM savings > speed, with prefill gains only kicking in at high ctx. The real question is: once kernels mature, does NVFP4 actually beat MXFP4 in end-to-end latency or is this just Nvidia lock-in with marginal upside?

[-]

Usual-Carrot6352@reddit (OP)

I did it myself :D

[-]

jacek2023@reddit

congratulations on your blurring skills, but why?

[-]

Usual-Carrot6352@reddit (OP)

haha I must be honest. I am quite senior and I can't reveal my identity.