What is the point of Nvidia's Jet-Nemotron-2B?

Posted by Ok_Warning2146@reddit | LocalLLaMA | View on Reddit | 13 comments

In their paper, they claiming 10x faster tokens per sec than its parent model Qwen2.5-1.5B. But in my own test using huggingface transformers, this is not the case. My setup: RTX 3050 6GB transformers 4.53.0 context length=1536 temperature=0.1 top\_p=0.8 repetitive\_penalty=1.25 system: You are a European History Professor named Professor Whitman. prompt: Why was Duke Vladivoj enfeoffed Duchy of Bohemia with the Holy Roman Empire in 1002? Does that mean Duchy of Bohemia was part of the Holy Roman Empire already? If so, when did the Holy Roman Empire acquired Bohemia? |Model|tokens|t/s| |:-|:-|:-| |gemma-3-1b-it|?|?| |Qwen3-1.7B|1433|5.03| |Qwen3-1.7B /nothink|771|5.04| |Jet-Nemotron-2B|312|3.38| |Qwen2.5-1.5B|226|6.22| Surprisingly, gemma-3-1b-it seems very good for its size and tried to role play to the system prompt. However, it seems to be quite slow. Qwen2.5-1.5B is useless as it generates Chinese when asked an English question. Qwen3 runs fast but it is very verbose in thinking mode. Turning off thinking seems to give better answer for historical questions. Jet-Nemotron 2B is slower than Qwen3-1.7B and the reply is not as good. So what is the point? I can only see the theoretical KV cache saving here. Replies from LLMs are detailed in the replies in this thread.