Nemotron 3 Super - large quality difference between llama.cpp and vLLM?
Posted by BigStupidJellyfish_@reddit | LocalLLaMA | View on Reddit | 24 comments
Hey all,
I have a private knowledge/reasoning benchmark I like to use for evaluating models.
It's a bit over 400 questions, intended for non-thinking modes, programatically scored.
It seems to correlate quite well with the model's quality, at least for my usecases.
Smaller models (24-32B) tend to score ~40%, larger ones (70B dense or somewhat larger MoEs) often score ~50%, and the largest ones I can run (Devstral 2/low quants of GLM 4.5-7) get up to ~60%.
On launch of Nemotron 3 Super it seemed llama.cpp support was not instantly there, so I thought I'd try vLLM to run the NVFP4 version.
It did surprisingly well on the test: 55.4% with 10 attempts per question.
Similar score to GPT-OSS-120B (medium/high effort).
But, running the model on llama.cpp, it does far worse: 40.2% with 20 attempts per question (unsloth Q4_K_XL).
My logs for either one look relatively "normal."
Obviously more errors with the gguf (and slightly shorter responses on average), but it was producing coherent text.
The benchmark script passes `{"enable_thinking": false}` either way to disable thinking, sets temperature 0.7, and otherwise leaves most parameters about default.
I reran the test in llama.cpp with nvidia's recommended temperature 1.0 and saw no difference.
In general, I haven't found temperature to have a significant impact on this test.
They also recommend top-p 0.95 but that seems to be the default anyways.
I generally see almost no significant difference between Q4_\*, Q8_0, and F16 ggufs, so I doubt there could be any inherent "magic" to NVFP4 making it do this much better.
Also tried bartowski's Q4_K_M quant and got a similar ~40% score.
Fairly basic launch commands, something like: `vllm serve "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" --port 8080 --trust-remote-code --gpu-memory-utilization 0.85` and `llama-server -c (whatever) -m NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL.gguf`.
So, the question: Is there some big difference in other generation parameters between these I'm missing that might be causing this, or another explanation?
I sat on this for a bit in case there was a bug in initial implementations but not seeing any changes with newer versions of llama.cpp.
I tried a different model to narrow things down:
- koboldcpp, gemma 3 27B Q8: 40.2%
- llama.cpp, gemma 3 27B Q8: 40.6%
- vLLM, gemma 3 27B F16: 40.0%
Pretty much indistinguishable. 5 attempts/question for each set here, and the sort of thing I'd expect to see.
Using vllm 0.17.1, llama.cpp 8522.
24 Comments
mrtrly@reddit
Conscious_Cut_6144@reddit
BigStupidJellyfish_@reddit (OP)
ikkiho@reddit
BigStupidJellyfish_@reddit (OP)
a_beautiful_rhind@reddit
BigStupidJellyfish_@reddit (OP)
a_beautiful_rhind@reddit
dreamkast06@reddit
Conscious_Cut_6144@reddit
BigStupidJellyfish_@reddit (OP)
-_Apollo-_@reddit
kevin_1994@reddit
BobbyL2k@reddit
BigStupidJellyfish_@reddit (OP)
StardockEngineer@reddit
a_beautiful_rhind@reddit
ilintar@reddit
BigStupidJellyfish_@reddit (OP)
ortegaalfredo@reddit
BigStupidJellyfish_@reddit (OP)
jacek2023@reddit
Middle_Bullfrog_6173@reddit
ImaginaryBluejay0@reddit