Anyone else been using the new nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 model?
Posted by kevin_1994@reddit | LocalLLaMA | View on Reddit | 28 comments
Its great! It's a clear step above Qwen3 32b imo. Id recommend trying it out
My experience with it: - it generates far less "slop" than Qwen models - it handles long context really well - it easily handles trick questions like "What should be the punishment for looking at your opponent's board in chess?" - handled all my coding questions really well - has a weird ass architecture where some layers dont have attention tensors which messed up llama.cpp tensor split allocation, but was pretty easy to overcome
My driver for a long time was Qwen3 32b FP16 but this model at Q8 has been a massive step up for me and ill be using it going forward.
Anyone else tried this bad boy out?
rerri@reddit
Having difficulty to get it to bypass thinking. /no_think in system prompt does not work.
Something like this in system prompt works sometimes but definitely not 100%: "You must never think step-by-step. Never use XML tags in your response. Just give the final response immediately."
Background-Gold-9882@reddit
Having this same issue. Did anyone solve this? This is my new favorite model but would like to speed it up for simpler queries. Running in Ollama, using this model: https://ollama.com/andiariffin/llama-3.3-nemotron-super-v1.5-q4km (Not sure how its creator converted it to Ollama, but seems to work really well)
On the Nvidia site it seems to work as intended, but I'm not sure what the checkbox toggle actually does: https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1_5
ttkciar@reddit
Have you tried pre-populating the reply with empty think tags?
-dysangel-@reddit
Given that the model is trained with "thinking" on, I'd have thought trying to force it not to think might take it out of distribution? Have you tried asking it not to "overthink"? I remember that worked ok for Qwen3 in my tests when I felt it was going overboard
rerri@reddit
Like Qwen3, this model is supposed to be a hybrid reasoning and non-reasoning model. Having /no_think in system prompt is supposedly the intended way to disable thinking. Quoting model card:
The model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities.
-dysangel-@reddit
ah ok. I tried /no_think with its predecessor and it didn't disable it, so I just assumed their RL/fine tuning had just been all with thinking enabled, even if the base model had "no think" mode
Evening_Ad6637@reddit
The predecessor had something else to disable thinking mode. Something with detailed_thinking_Mode or something like that.
jacek2023@reddit
It's currently most powerful dense model (excluding >100B models which are unusable at home). Check previous discussions about it.
kevin_1994@reddit (OP)
I was reading those discussions but most devolved into accusing NVIDIA of benchmaxxing. Just thought id share some positive thoughts on the model here
matznerd@reddit
I appreciate the details! This model is now top on DeepResearch Bench https://x.com/rohanpaul_ai/status/1952584208993443987
EnnioEvo@reddit
It's it better than Magistral?
Paradigmind@reddit
What's good about Magistral? I'm curiously asking.
EnnioEvo@reddit
Italian
kevin_1994@reddit (OP)
I didn't have a good experience with Magistral. I think the new mistral models are good for agentic flows, but borderline useless for anything else, as their param count and knowledge depth is too low, and they will hallucinate too much. Ymmv
perelmanych@reddit
How would you compare it to Qwen3-235b-a22b-2507 thinking and non-thinking variants? Honesty, I am a bit disappointed with Qwen3-235b-a22b-2507 models at least in terms of academic writing. I think they are overhyped. DS-V3-0324 is much better for my use case, unfortunately its local implementation is out of reach for my HW.
TokenRingAI@reddit
V3 is just a really good model, that sits in R1s shadow
MichaelXie4645@reddit
Can you elaborate on how is a clear step up from 32B Qwen 3? Like how is it better? Better at coding, math, reasoning? Etc.
kevin_1994@reddit (OP)
Hmm
Sorry if this is less than scientific but...
The way id summarize the model is if llama3 70b and QwQ had a baby. You get the deeper less benchmaxxed knowledge of llama3, and the rigorous qwen-style reasoning of QwQ
MichaelXie4645@reddit
Oh nice, I’ve been using Qwen3 32B FP8, but how were u getting FP8 of nemotron? I can’t find any fp8 quants, did you just use vllm’s quant or something like that?
kevin_1994@reddit (OP)
Yeah, unfortunately doesnt seem to have many safetensor quants. Im running unsloths q8xl quant. I prefer his dynamic quants anyways as they seem to outperform basic fp8 quants in my experience. But yeah throughout is much lower for sure
MichaelXie4645@reddit
Does it have a thinking / non thinking switch as well?
FullOf_Bad_Ideas@reddit
I tried it out in full glory on H200 yesterday. It seems really good, and is probably going to be the most capable model I'll be able to run locally once I get 4-bit quant (preferably EXL3 or GPTQ or AWQ) running. It's really slow to get anything out of it, so I doubt it will work with Cline as well as Qwen 3 32B FP8 - I can wait for 500-1000 reasoning tokens to generate mid-reply, but when it has to generate 15k tokens to accomplish a task, it's no longer as useful as it could be.
kevin_1994@reddit (OP)
I honestly havent found it to be too bad for insane reasoning length but you and others have. It reminds me a lot of QwQ
CaptBrick@reddit
Good to hear. Thanks for sharing. What is your hardware setup and what speed do you get? Also, what context length are you using?
kevin_1994@reddit (OP)
2x3090, 2x3060
Running at Q8 with 17 tok/s tg, 350 tok/s pp.
Using 64k context
NixTheFolf@reddit
I have not yet but I plan on doing so! I was wondering if you had any experience with the general knowledge of the model? I have a preference to models that have good world knowledge for their size, which is something Qwen has always struggled with.
kevin_1994@reddit (OP)
Its MUCH better than Qwen but still more STEM focused than the base 3.3
toothpastespiders@reddit
I just checked the ggufs and got reminded why I never played around with the original very much. I sear setting up a single GPU system at the start of all this is one of the biggest tech mistakes I ever made.
That said, thanks for the reminder. I've just started hearing a trickle of good buzz about this. Enough that I do want to give it a shot.