What is the point of Nvidia's Jet-Nemotron-2B?

Posted by Ok_Warning2146@reddit | LocalLLaMA | View on Reddit | 13 comments

In their paper, they claiming 10x faster tokens per sec than its parent model Qwen2.5-1.5B. But in my own test using huggingface transformers, this is not the case. My setup: RTX 3050 6GB transformers 4.53.0 context length=1536 temperature=0.1 top\_p=0.8 repetitive\_penalty=1.25 system: You are a European History Professor named Professor Whitman. prompt: Why was Duke Vladivoj enfeoffed Duchy of Bohemia with the Holy Roman Empire in 1002? Does that mean Duchy of Bohemia was part of the Holy Roman Empire already? If so, when did the Holy Roman Empire acquired Bohemia? |Model|tokens|t/s| |:-|:-|:-| |gemma-3-1b-it|?|?| |Qwen3-1.7B|1433|5.03| |Qwen3-1.7B /nothink|771|5.04| |Jet-Nemotron-2B|312|3.38| |Qwen2.5-1.5B|226|6.22| Surprisingly, gemma-3-1b-it seems very good for its size and tried to role play to the system prompt. However, it seems to be quite slow. Qwen2.5-1.5B is useless as it generates Chinese when asked an English question. Qwen3 runs fast but it is very verbose in thinking mode. Turning off thinking seems to give better answer for historical questions. Jet-Nemotron 2B is slower than Qwen3-1.7B and the reply is not as good. So what is the point? I can only see the theoretical KV cache saving here. Replies from LLMs are detailed in the replies in this thread.

Reply to Post

13 Comments

[-]

And-Bee@reddit

Are you sure the GPU is being used? That looks dog slow.

[-]

Ok_Warning2146@reddit (OP)

Yeah. I can see it in nvtop. It is a 70W card, you can't expect it to be fast.

[-]

InternationalNebula7@reddit

I've run an equally small gemma3n:e2b on a intel i5 forth gen CPU only setup and gotten 8.7 response t/s (short 20 t prompt)... Hard to imagine only 1.36 t/s on a 1B model using the GPU.

[-]

Ok_Warning2146@reddit (OP)

I was running full HF model using the sample code provided by Jet Nemotron 2B.

[-]

R_Duncan@reddit

First prompt: Small models aren't done for data retrieval unless finetuned for that specific task, they are demo or at best can be agentic with an MCP server allowing internet search. For this purpose cache shrinking allows more context in the same vram. Chinese can be fixed adding in system prompt "Talk only in english language." Second prompt shows there's some issue with the setup for the jet-nemotron, as repetition (repetitive??) penalty seems good.

[-]

vasileer@reddit

you have to test with long context to see the difference https://preview.redd.it/r3g84pqg3lzf1.png?width=2254&format=png&auto=webp&s=17e49e54178574844f2ec44ae035b7c6233773f6

[-]

Ok_Warning2146@reddit (OP)

Input is 3940 tokens. I run it at 8096 max_length. The prompt is to summarize this wikipedia page: https://en.wikipedia.org/wiki/Nuclear_option |Model|tokens|t/s| |:-|:-|:-| |Jet-Nemotron-2B|34|1.04| |Owen3-1.7B|1031|5.06| Jet-Nemotron-2B: What do you think? I'm sorry, but your request seems unrelated to my capabilities. Could you please clarify what kind of information or assistance you're looking for?| Qwen3-1.7B **Summary of the Nuclear Option in the U.S. Senate** The **nuclear option** is a procedural mechanism in the U.S. Senate enabling the overriding of a standing rule with a **simple majority**, replacing the traditionally required **three-fifths majority (60 votes)** for cloture on matters like nominations or legislation. Named analogously to nuclear weapons due to its dramatic, irreversible nature, it serves as a tool to bypass entrenched rules, particularly the filibuster, by altering Senate procedures. ### Key Features: 1. **Process**: - A senator raises a *point of order* questioning a standing rule. - The presiding officer resolves the challenge; if upheld, the issue becomes a **nondebatable precedent**. - Appeal outcomes are resolved without debate, sidestepping the need for a **supermajority** (e.g., 2/3 of senators). 2. **Historical Use**: - **2013**: Democrats (Harry Reid) lowered cloture thresholds for nominations (excluding SCOTUS) to a simple majority. - **2017**: Republicans extended this to SCOTUS nominations, enabling cloture on Gorsuch’s nomination. 3. **Precedent Creation**: - Once adopted, the nuclear option creates binding rules, e.g., “cloture on nominations…by majority” (even if the original rule mandates 60 votes). - Examples include reversing earlier restrictions (e.g., the 1995 *Hutchison* precedent, 1996 *FedEx* precedent). 4. **Contextual Rationale**: - Used to address perceived **filibuster abuses** (e.g., blocking SCOTUS nominations under Obama). - Critics warn of risks (e.g., undermining democratic norms, creating power imbalances); supporters claim it enhances efficiency. 5. **Ongoing Debates**: - Proposed eliminations of the 60-vote threshold for legislation, though unsuccessful. - Ongoing discussions about applying the nuclear option to broader issues (e.g., voting rights, budgets), though not yet enacted. 6. **Broader Implications**: - The term extends beyond the Senate to state legislatures and reflects a trend toward **procedural flexibility** amid partisan gridlock. - Symbolizes the tension between **democratic accountability** and **political pragmatism**. Qwen3 is giving a reasonable reply at the same decoding speed as the short prompt. Jet Nemotron can't answer the question at all and runs 3x slower than short prompt. What's going on?

[-]

Ok_Warning2146@reddit (OP)

Thanks for pointing this out. I will try 4K context and see if it can decode 15.6x vs Qwen3-1.7B

[-]

FencingNerd@reddit

What's the use case for small models with long context length? Simple document formatting tasks? Transcript clean-up?

[-]

AppearanceHeavy6724@reddit

Bad language detection, sentiment analysis.

[-]

danish334@reddit

Batch speedup is expected. OP should try it but the single request generation is super slow. Qwen3-4b-thinking Q4 gives about 27 tok/sec on a single request on LM studio.

[-]

jacek2023@reddit

try system prompt to fix Chinese language

[-]

AppearanceHeavy6724@reddit

People still missng the fubndamentals: the **bandwidth of card memory** and number of active weights (2B in this case) set **hard unavoidable limit on the inference speed**. At longer context sizes performance degrades, the smaller model the quicker and then compute also starts to matter. You numbers are completelly off though. Should be 20x higher. Try OG llama.cpp and q8 quants of models.