Anyone else been using the new nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 model?

Posted by kevin_1994@reddit | LocalLLaMA | View on Reddit | 28 comments

Its great! It's a clear step above Qwen3 32b imo. Id recommend trying it out

My experience with it: - it generates far less "slop" than Qwen models - it handles long context really well - it easily handles trick questions like "What should be the punishment for looking at your opponent's board in chess?" - handled all my coding questions really well - has a weird ass architecture where some layers dont have attention tensors which messed up llama.cpp tensor split allocation, but was pretty easy to overcome

My driver for a long time was Qwen3 32b FP16 but this model at Q8 has been a massive step up for me and ill be using it going forward.

Anyone else tried this bad boy out?

[-]

rerri@reddit

Having difficulty to get it to bypass thinking. /no_think in system prompt does not work.

Something like this in system prompt works sometimes but definitely not 100%: "You must never think step-by-step. Never use XML tags in your response. Just give the final response immediately."

[-]

Background-Gold-9882@reddit

Having this same issue. Did anyone solve this? This is my new favorite model but would like to speed it up for simpler queries. Running in Ollama, using this model: https://ollama.com/andiariffin/llama-3.3-nemotron-super-v1.5-q4km (Not sure how its creator converted it to Ollama, but seems to work really well)

On the Nvidia site it seems to work as intended, but I'm not sure what the checkbox toggle actually does: https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1_5

[-]

ttkciar@reddit

Have you tried pre-populating the reply with empty think tags?

[-]

-dysangel-@reddit

Given that the model is trained with "thinking" on, I'd have thought trying to force it not to think might take it out of distribution? Have you tried asking it not to "overthink"? I remember that worked ok for Qwen3 in my tests when I felt it was going overboard

[-]

rerri@reddit

Like Qwen3, this model is supposed to be a hybrid reasoning and non-reasoning model. Having /no_think in system prompt is supposedly the intended way to disable thinking. Quoting model card:

The model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities.

[-]

-dysangel-@reddit

ah ok. I tried /no_think with its predecessor and it didn't disable it, so I just assumed their RL/fine tuning had just been all with thinking enabled, even if the base model had "no think" mode

[-]

Evening_Ad6637@reddit

The predecessor had something else to disable thinking mode. Something with detailed_thinking_Mode or something like that.

[-]

jacek2023@reddit

It's currently most powerful dense model (excluding >100B models which are unusable at home). Check previous discussions about it.

[-]

kevin_1994@reddit (OP)

I was reading those discussions but most devolved into accusing NVIDIA of benchmaxxing. Just thought id share some positive thoughts on the model here

[-]

matznerd@reddit

I appreciate the details! This model is now top on DeepResearch Bench https://x.com/rohanpaul_ai/status/1952584208993443987

[-]

EnnioEvo@reddit

It's it better than Magistral?

[-]

Paradigmind@reddit

What's good about Magistral? I'm curiously asking.

[-]

EnnioEvo@reddit

Italian

[-]

kevin_1994@reddit (OP)

I didn't have a good experience with Magistral. I think the new mistral models are good for agentic flows, but borderline useless for anything else, as their param count and knowledge depth is too low, and they will hallucinate too much. Ymmv

[-]

perelmanych@reddit

How would you compare it to Qwen3-235b-a22b-2507 thinking and non-thinking variants? Honesty, I am a bit disappointed with Qwen3-235b-a22b-2507 models at least in terms of academic writing. I think they are overhyped. DS-V3-0324 is much better for my use case, unfortunately its local implementation is out of reach for my HW.

[-]

TokenRingAI@reddit

V3 is just a really good model, that sits in R1s shadow

[-]

MichaelXie4645@reddit

Can you elaborate on how is a clear step up from 32B Qwen 3? Like how is it better? Better at coding, math, reasoning? Etc.

[-]

kevin_1994@reddit (OP)

Hmm

Sorry if this is less than scientific but...

it feels like the reasoning itself is about on par with qwen3, but is more similarly structured to QwQ. QwQ would sometimes use a lot of tokens to get the job done, but imo, this is helpful for complex problems
it has WAY more knowledge than Qwen3 32b and much more common sense. I found this helps a lot with coding as it has better foundational understanding of various core libraries
it is still sycophantic, but less so than Qwen, and will sometimes push back or tell you youre wrong

The way id summarize the model is if llama3 70b and QwQ had a baby. You get the deeper less benchmaxxed knowledge of llama3, and the rigorous qwen-style reasoning of QwQ

[-]

MichaelXie4645@reddit

Oh nice, I’ve been using Qwen3 32B FP8, but how were u getting FP8 of nemotron? I can’t find any fp8 quants, did you just use vllm’s quant or something like that?

[-]

kevin_1994@reddit (OP)

Yeah, unfortunately doesnt seem to have many safetensor quants. Im running unsloths q8xl quant. I prefer his dynamic quants anyways as they seem to outperform basic fp8 quants in my experience. But yeah throughout is much lower for sure

[-]

MichaelXie4645@reddit

Does it have a thinking / non thinking switch as well?

[-]

FullOf_Bad_Ideas@reddit

I tried it out in full glory on H200 yesterday. It seems really good, and is probably going to be the most capable model I'll be able to run locally once I get 4-bit quant (preferably EXL3 or GPTQ or AWQ) running. It's really slow to get anything out of it, so I doubt it will work with Cline as well as Qwen 3 32B FP8 - I can wait for 500-1000 reasoning tokens to generate mid-reply, but when it has to generate 15k tokens to accomplish a task, it's no longer as useful as it could be.

[-]

kevin_1994@reddit (OP)

I honestly havent found it to be too bad for insane reasoning length but you and others have. It reminds me a lot of QwQ

[-]

CaptBrick@reddit

Good to hear. Thanks for sharing. What is your hardware setup and what speed do you get? Also, what context length are you using?

[-]

kevin_1994@reddit (OP)

2x3090, 2x3060

Running at Q8 with 17 tok/s tg, 350 tok/s pp.

Using 64k context

[-]

NixTheFolf@reddit

I have not yet but I plan on doing so! I was wondering if you had any experience with the general knowledge of the model? I have a preference to models that have good world knowledge for their size, which is something Qwen has always struggled with.

[-]

kevin_1994@reddit (OP)

Its MUCH better than Qwen but still more STEM focused than the base 3.3

[-]

toothpastespiders@reddit

I just checked the ggufs and got reminded why I never played around with the original very much. I sear setting up a single GPU system at the start of all this is one of the biggest tech mistakes I ever made.

That said, thanks for the reminder. I've just started hearing a trickle of good buzz about this. Enough that I do want to give it a shot.