Nemotron 3 Super - large quality difference between llama.cpp and vLLM?

Posted by BigStupidJellyfish_@reddit | LocalLLaMA | View on Reddit | 24 comments

Hey all, I have a private knowledge/reasoning benchmark I like to use for evaluating models. It's a bit over 400 questions, intended for non-thinking modes, programatically scored. It seems to correlate quite well with the model's quality, at least for my usecases. Smaller models (24-32B) tend to score ~40%, larger ones (70B dense or somewhat larger MoEs) often score ~50%, and the largest ones I can run (Devstral 2/low quants of GLM 4.5-7) get up to ~60%. On launch of Nemotron 3 Super it seemed llama.cpp support was not instantly there, so I thought I'd try vLLM to run the NVFP4 version. It did surprisingly well on the test: 55.4% with 10 attempts per question. Similar score to GPT-OSS-120B (medium/high effort). But, running the model on llama.cpp, it does far worse: 40.2% with 20 attempts per question (unsloth Q4_K_XL). My logs for either one look relatively "normal." Obviously more errors with the gguf (and slightly shorter responses on average), but it was producing coherent text. The benchmark script passes `{"enable_thinking": false}` either way to disable thinking, sets temperature 0.7, and otherwise leaves most parameters about default. I reran the test in llama.cpp with nvidia's recommended temperature 1.0 and saw no difference. In general, I haven't found temperature to have a significant impact on this test. They also recommend top-p 0.95 but that seems to be the default anyways. I generally see almost no significant difference between Q4_\*, Q8_0, and F16 ggufs, so I doubt there could be any inherent "magic" to NVFP4 making it do this much better. Also tried bartowski's Q4_K_M quant and got a similar ~40% score. Fairly basic launch commands, something like: `vllm serve "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" --port 8080 --trust-remote-code --gpu-memory-utilization 0.85` and `llama-server -c (whatever) -m NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL.gguf`. So, the question: Is there some big difference in other generation parameters between these I'm missing that might be causing this, or another explanation? I sat on this for a bit in case there was a bug in initial implementations but not seeing any changes with newer versions of llama.cpp. I tried a different model to narrow things down: - koboldcpp, gemma 3 27B Q8: 40.2% - llama.cpp, gemma 3 27B Q8: 40.6% - vLLM, gemma 3 27B F16: 40.0% Pretty much indistinguishable. 5 attempts/question for each set here, and the sort of thing I'd expect to see. Using vllm 0.17.1, llama.cpp 8522.

Reply to Post

24 Comments

[-]

mrtrly@reddit

The tokenizer theory tracks. I ran into something similar with a different model where llama.cpp was silently using a slightly different chat template, and output quality tanked without any obvious reason. Perplexity looked fine, but reasoning tasks broke. Grab the HF tokenizer directly and test a few prompts side by side, that'll tell you fast if it's the culprit.

[-]

Conscious_Cut_6144@reddit

Just ran nvfp4 and unsloths q4-k-xl through my benchmark. GGUF scored 1% higher for me. When you say 20 attempts, are you giving it 20 chances to get it right once, or just picking the most common answer during the 20 attempts?

[-]

BigStupidJellyfish_@reddit (OP)

Interesting! Is that with or without reasoning - I wonder if that could be a factor? I tried running it with thinking on this test, but cancelled early on as it was consistently misunderstanding the directions in a way I only also saw Qwen3.5 0.8B do. I had put off running the gguf on another much easier with-reasoning test I have because of how much slower it runs, but it's still doing noticeably worse there. 96% (x5 samples/question) with vllm/NVFP4 vs 75% (x1 sample) with llama.cpp/Q4_K_XL. "Decent" reasoning models typically score 90-97%. Though with fewer questions (113) and attempts I can't make too strong of a claim. Average score over that many attempts (and each attempt is binary correct/incorrect), I couldn't compare the scores otherwise. To be fair, the correlation within each question is relatively high so it doesn't increase statistical power exactly as much as you might like, but there are enough questions+attempts for it to be fairly stable.

[-]

ikkiho@reddit

my bet is its a chat template / tokenizer issue rather than quant quality. vllm loads the native HF tokenizer and all the trust-remote-code stuff directly from the repo, while llama.cpp has to reimplement all of that. for a model this new with custom code theres a decent chance the gguf conversion or the template in llama.cpp is slightly off, especially around how thinking mode gets disabled. 15 percentage points is just way too big to be quant degradation alone when your own data shows Q4 vs F16 is normally 1-2% apart. id try comparing the actual prompts being sent to each backend token by token if you can, might find something weird in how the system prompt or the enable\_thinking flag gets formatted

[-]

BigStupidJellyfish_@reddit (OP)

Tests use the /v1/chat/completions endpoint, if that makes a difference. It is at least "not thinking" properly in both llama.cpp & vllm. I've left the system prompt empty in all cases - maybe the jinja template is overriding that and causing issues? Though when I've experimented with random system prompts in the past they haven't had effects like this on the score. Using a demo question ("Evaluate: 12!!") and `--verbose` in llama.cpp (apologies in advance for the walls of text), looks something like this: Attempt 1 (correct): ``` Parsed message: {"role":"assistant","content":"The double factorial notation \$n!!\$ means the product of all integers from \$n\$ down to 1 that have the same parity (odd or even) as \$n\$. For an even number like 12, \$12!! = 12 \\times 10 \\times 8 \\times 6 \\times 4 \\times 2\$. Calculating step by step: \$12 \\times 10 = 120\$, \$120 \\times 8 = 960\$, \$960 \\times 6 = 5760\$, \$5760 \\times 4 = 23040\$, and \$23040 \\times 2 = 46080\$.\n\nFinal answer: 46080"} srv stop: all tasks already finished, no need to cancel res remove_waiti: remove task 0 from waiting list. current waiting = 1 (before remove) srv stop: all tasks already finished, no need to cancel srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200 srv log_server_r: request: {"messages": [{"role": "user", "content": "Evaluate: 12!!\n\nGive a brief explanation in one paragraph or less (if required). Then, on a new line, clearly write: Final answer: [your answer]."}], "max_tokens": 512, "temperature": 1.0, "chat_template_kwargs": {"enable_thinking": false}} srv log_server_r: response: {"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"The double factorial notation \$n!!\$ means the product of all integers from \$n\$ down to 1 that have the same parity (odd or even) as \$n\$. For an even number like 12, \$12!! = 12 \\times 10 \\times 8 \\times 6 \\times 4 \\times 2\$. Calculating step by step: \$12 \\times 10 = 120\$, \$120 \\times 8 = 960\$, \$960 \\times 6 = 5760\$, \$5760 \\times 4 = 23040\$, and \$23040 \\times 2 = 46080\$.\n\nFinal answer: 46080"}}],"created":1774736987,"model":"NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q6_K_XL.gguf","system_fingerprint":"aaa","object":"chat.completion","usage":{"completion_tokens":170,"prompt_tokens":53,"total_tokens":223,"prompt_tokens_details":{"cached_tokens":0}},"id":"chatcmpl-aaa","__verbose":{"index":0,"content":"The double factorial notation \$n!!\$ means the product of all integers from \$n\$ down to 1 that have the same parity (odd or even) as \$n\$. For an even number like 12, \$12!! = 12 \\times 10 \\times 8 \\times 6 \\times 4 \\times 2\$. Calculating step by step: \$12 \\times 10 = 120\$, \$120 \\times 8 = 960\$, \$960 \\times 6 = 5760\$, \$5760 \\times 4 = 23040\$, and \$23040 \\times 2 = 46080\$.\n\nFinal answer: 46080","tokens":[],"id_slot":15,"stop":true,"model":"NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q6_K_XL.gguf","tokens_predicted":170,"tokens_evaluated":53,"generation_settings":{"seed":4294967295,"temperature":1.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":1024,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":512,"n_predict":512,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[12,13,14,15,1062],"chat_format":"peg-native","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"<|im_start|>assistant\n<think></think>","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"speculative.type":"none","speculative.ngram_size_n":1024,"speculative.ngram_size_m":1024,"speculative.ngram_m_hits":1024,"timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\n<|im_end|>\n<|im_start|>user\nEvaluate: 12!!\n\nGive a brief explanation in one paragraph or less (if required). Then, on a new line, clearly write: Final answer: [your answer].<|im_end|>\n<|im_start|>assistant\n<think></think>","has_new_line":true,"truncated":false,"stop_type":"eos","stopping_word":"","tokens_cached":222,"timings":{"cache_n":0,"prompt_n":53,"prompt_ms":1995.178,"prompt_per_token_ms":37.644867924528306,"prompt_per_second":26.56404591470034,"predicted_n":170,"predicted_ms":10560.016,"predicted_per_token_ms":62.11774117647059,"predicted_per_second":16.098460456878094}},"timings":{"cache_n":0,"prompt_n":53,"prompt_ms":1995.178,"prompt_per_token_ms":37.644867924528306,"prompt_per_second":26.56404591470034,"predicted_n":170,"predicted_ms":10560.016,"predicted_per_token_ms":62.11774117647059,"predicted_per_second":16.098460456878094}} ``` Attempt 2 (incorrect): ``` Parsed message: {"role":"assistant","content":"12"} srv stop: all tasks already finished, no need to cancel res remove_waiti: remove task 172 from waiting list. current waiting = 1 (before remove) srv stop: all tasks already finished, no need to cancel srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200 srv log_server_r: request: {"messages": [{"role": "user", "content": "Evaluate: 12!!\n\nGive a brief explanation in one paragraph or less (if required). Then, on a new line, clearly write: Final answer: [your answer]."}], "max_tokens": 512, "temperature": 1.0, "chat_template_kwargs": {"enable_thinking": false}} srv log_server_r: response: {"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"12"}}],"created":1774737044,"model":"NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q6_K_XL.gguf","system_fingerprint":"aaa","object":"chat.completion","usage":{"completion_tokens":3,"prompt_tokens":53,"total_tokens":56,"prompt_tokens_details":{"cached_tokens":0}},"id":"chatcmpl-aaa","__verbose":{"index":0,"content":"12","tokens":[],"id_slot":15,"stop":true,"model":"NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q6_K_XL.gguf","tokens_predicted":3,"tokens_evaluated":53,"generation_settings":{"seed":4294967295,"temperature":1.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":1024,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":512,"n_predict":512,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[12,13,14,15,1062],"chat_format":"peg-native","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"<|im_start|>assistant\n<think></think>","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"speculative.type":"none","speculative.ngram_size_n":1024,"speculative.ngram_size_m":1024,"speculative.ngram_m_hits":1024,"timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\n<|im_end|>\n<|im_start|>user\nEvaluate: 12!!\n\nGive a brief explanation in one paragraph or less (if required). Then, on a new line, clearly write: Final answer: [your answer].<|im_end|>\n<|im_start|>assistant\n<think></think>","has_new_line":false,"truncated":false,"stop_type":"eos","stopping_word":"","tokens_cached":55,"timings":{"cache_n":0,"prompt_n":53,"prompt_ms":1691.93,"prompt_per_token_ms":31.923207547169813,"prompt_per_second":31.325173027252898,"predicted_n":3,"predicted_ms":119.73,"predicted_per_token_ms":39.910000000000004,"predicted_per_second":25.056376847907792}},"timings":{"cache_n":0,"prompt_n":53,"prompt_ms":1691.93,"prompt_per_token_ms":31.923207547169813,"prompt_per_second":31.325173027252898,"predicted_n":3,"predicted_ms":119.73,"predicted_per_token_ms":39.910000000000004,"predicted_per_second":25.056376847907792}} ``` A 3rd run also ended up wrong, though I might be at a character limit here. Same prompt templating as my full benchmark; the vllm/NVFP4 version was consistently correct on this question.

[-]

a_beautiful_rhind@reddit

You need to find a way for it to dump the actual templated text it's sending. I compile the server with debug and add --verbose then I get it. Failing that, use text completions.

[-]

BigStupidJellyfish_@reddit (OP)

Unless I'm misunderstanding, I think that's in the last line? `[...],"prompt":"<|im_start|>system\n<|im_end|>\n<|im_start|>user\nEvaluate: 12!!\n\nGive a brief explanation[...]` (the generic formatting instructions there are what the test used.) On perplexity, not sure how to do it with the NVFP4 version, but using llama-perplexity on wikitext.valid.raw from https://huggingface.co/datasets/ggml-org/ci, I get: Final estimate: PPL = 4.9307 +/- 0.03242 (NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL.gguf) For a similarly sized model, Qwen3.5 122B Q4 got 5.8623 so it seems normal enough there.

[-]

a_beautiful_rhind@reddit

Yea that part. I didn't see it in your pastes.

[-]

dreamkast06@reddit

Nemotron 3 Super was trained with NVFP4; not quantized to NVFP4, trained with NVFP4. Any of the GGUF will be upscaled to BF16, then quantized down, resulting in the terrible degradation. Until there is native NVFP4 in llama.cpp, the model won't work as intended, similar to how GPT-OSS won't function properly without the weights being MXFP4.

[-]

Conscious_Cut_6144@reddit

How recent is your copy of Q4\_K\_XL, Wasn't this the model that had quant issues the first day?

[-]

BigStupidJellyfish_@reddit (OP)

Yeah, I think it took a day or two for its support to be merged. My copies of the Unsloth Q4_K_XL & Bartowski Q4_K_M were both downloaded Mar 14. From what I see on hf, their files were last updated [Mar 11](https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF/tree/main/UD-Q4_K_XL) and [Mar 12](https://huggingface.co/bartowski/nvidia_Nemotron-3-Super-120B-A12B-GGUF/tree/main/nvidia_Nemotron-3-Super-120B-A12B-Q4_K_M) respectively so they should be the latest.

[-]

-_Apollo-_@reddit

Did you also experience this with qwen 3.5 27b?

[-]

kevin_1994@reddit

interesting. nvfp4 for this model in particular is supposed to be close to bf16 according to the benchmarks. on huggingface you can see nvfp4 apparently beats bf16 in about half the benches I feel like this is therefore probably just the degradation from q4, no?

[-]

BobbyL2k@reddit

Nemotron Super is natively trained in NVFP4. So the NVFP4 version is the original weight produced from training. The high precision versions are there for hardware than do not support running NVFP4. NVFP4 is not close to BF16. BF16 is close to NVFP4.

[-]

BigStupidJellyfish_@reddit (OP)

I unfortunately can't fully reject that as I can't run the F16 version to confirm, but I've never seen a score differential like this unless talking about stuff like IQ2_XXS quants, so I'm suspicious. Under llama.cpp it's scoring similarly to Mistral 3.2 24B, Gemma3 27B, etc, which feels very wrong. Usually Q4 vs Q8 vs F16 is within 1 or 2% of each other on this test, sometimes the opposite of how you'd expect due to minor variance (e.g., the Gemma 3 results above). Another example of the typical score spread on an earlier version of the test: Qwen_Qwen3-4B-Instruct-2507-bf16: 25.2% (957/3800) Qwen_Qwen3-4B-Instruct-2507-Q8_0: 25.1% (954/3800) Qwen_Qwen3-4B-Instruct-2507-Q4_K_M: 24.2% (918/3800) Qwen_Qwen3-4B-Instruct-2507-IQ4_XS: 23.6% (898/3800) (380 questions x 10 attempts, all in llama.cpp)

[-]

StardockEngineer@reddit

I noticed the same thing.

[-]

a_beautiful_rhind@reddit

Did you check PPL between them? Is it normal?

[-]

ilintar@reddit

Interesting. Will check.

[-]

BigStupidJellyfish_@reddit (OP)

Much appreciated, and thank you for your contributions in general :) I unfortunately can't provide the full dataset, but if helpful I ran a demo question [in verbose mode here](https://reddit.com/r/LocalLLaMA/comments/1s69tfk/nemotron_3_super_large_quality_difference_between/od1a8tn/). The NVFP4/vllm model setup gets that one correct consistently.

[-]

ortegaalfredo@reddit

Hundreds of variables are to blame, for example you maybe are using NVFP$ quant from NVIDIA, they know how to do quants properly, but there are many recipes to do a Q4 and each do a little different. To be sure, I would use at least a Q6 for llama.cpp.

[-]

BigStupidJellyfish_@reddit (OP)

Fair. Runs slower from offloading so only 3 samples per question but I'm seeing a similarly low score: 41.6%, freshly downloaded unsloth Q6_K_XL/llama.cpp. (Now noticing the Q8 isn't much larger and may be a better reference point, but that would take a while more to download/test.)

[-]

Nemotron 3 Super - large quality difference between llama.cpp and vLLM?

Reply to Post

24 Comments

mrtrly@reddit

Conscious_Cut_6144@reddit

BigStupidJellyfish_@reddit (OP)

ikkiho@reddit

BigStupidJellyfish_@reddit (OP)

a_beautiful_rhind@reddit

BigStupidJellyfish_@reddit (OP)

a_beautiful_rhind@reddit

dreamkast06@reddit

Conscious_Cut_6144@reddit

BigStupidJellyfish_@reddit (OP)

-_Apollo-_@reddit

kevin_1994@reddit

BobbyL2k@reddit

BigStupidJellyfish_@reddit (OP)

StardockEngineer@reddit

a_beautiful_rhind@reddit

ilintar@reddit

BigStupidJellyfish_@reddit (OP)

ortegaalfredo@reddit

BigStupidJellyfish_@reddit (OP)

jacek2023@reddit

Middle_Bullfrog_6173@reddit

ImaginaryBluejay0@reddit