The 4B class of 2026 (benchmark)
Posted by FederalAnalysis420@reddit | LocalLLaMA | View on Reddit | 58 comments
Bench 2 from my 18GB M3 Pro. Last week was specialists vs generalists at 7-8B (which I hosed by giving thinking models a 128-token budget, so half the post was an apology). This week: the 4B class of 2026, every model released or actively-current at the 3-4B size, head-to-head on the same task suite.
Lineup (sizes on disk): gemma4:e4b 9.6 GB Google, Apr 2 2026 qwen3.5:4b 3.4 GB Alibaba, Mar 1 2026 granite4:3b 2.1 GB IBM, Oct 2025 nemotron-3-nano:4b 2.8 GB NVIDIA, Mar 2026 phi4-mini:3.8b 2.5 GB Microsoft, late 2024
39 tasks: 15 finance (P/E, NPV, CAGR, Sharpe), 15 reasoning (word problems, syllogisms, probability), 9 code (FizzBuzz-tier). 3 trials per (model × task), median aggregation. temp=0, seed=42, max_tokens=1024.
Headline: Nemotron 3 Nano won and it's not close
model overall finance reasoning code nemotron-3-nano:4b 85% 100% 80% 67% phi4-mini:3.8b 77% 80% 60% 100% gemma4:e4b 62% 60% 60% 67% granite4:3b 54% 60% 20% 100% qwen3.5:4b 15% 20% 20% 0%
NVIDIA's nano is barely a month old and went 15-for-15 on finance.
Looking at the responses (visible in the gist), it's a thinking model, </think> tags before final answers, and it actually finishes its
thinking inside the 1024-token budget. The reasoning is clean:
"compute (1.08)^5. 1.08^2=1.1664, ^3=1.259712, ^4=1.36048896,
^5=1.4693280768. So PV = 100,000 / 1.4693280768 = approx 68,058."
That's a 2.8 GB model on disk producing the right answer with the right intermediate work. On finance specifically, it beat every larger model.
Lab personalities are real at this size
Look at the per-category lines for granite4:3b vs nemotron-3-nano:4b:
granite: code 100%, reasoning 20% nemotron: code 67%, reasoning 80%
Two ~3-4 GB models, almost-mirror-image profiles. Granite is a dedicated coder with weak reasoning. Nemotron is a dedicated reasoner with mediocre code. Both come from labs (IBM, NVIDIA) that don't position these as specialist models, they're marketed as general-purpose at this size. The marketing is wrong; the data shows clear specialization.
phi4-mini sits in between: 100% on code, 80% on finance, 60% on reasoning. The most balanced of the bunch and the bang-for-GB winner at 30.8 accuracy-pct per GB on disk.
The Qwen 3.5 4b problem
15% accuracy. 30 of 39 responses empty (avg response length: 21 chars out of a 1024-token budget). Same failure mode as Qwen3:4b in bench 1 four months ago. Thinking model that can't finish thinking inside a fixed budget that's reasonable for non-thinking models in the same weight class.
Looking at one of the truncated responses: it gets to "$$PV = \frac{100,000}{(1 + 0.08)^5}$$" and runs out of budget mid-formula. The model isn't broken; my budget gave thinking models 1024 tokens when they need 4096+ to finish. Granite finishes in ~75 tokens average, Nemotron in ~170, Qwen 3.5 4b is using its full 914 tokens on visible-plus-hidden output and still not finishing.
This is now a pattern across two bench posts. The eval ecosystem has a thinking-model-in-fixed-budget problem and I don't think the answer is "make the budget bigger", that punishes the non-thinkers with bloated runs and obscures what's actually being measured.
I'm going to try per-model token budgets in bench 3. Open to better ideas, comment if you have them.
Methodology + repo
Apple M3 Pro, 18 GB, macOS 25.5, Ollama 0.21. temp=0, seed=42, max_tokens=1024 across all models (this is the design flaw above). 3 trials per task, median aggregation. All graders are deterministic regex/numeric/exec, no LLM-as-judge.
Repo: https://github.com/joshuahickscorp/bench2 Raw JSONL with full responses + per-token timings: https://gist.github.com/joshuahickscorp/1e8947e2f14dea0930f6f33d987c335e
Up next
Bench 3: lab personalities deep-dive. Should land in 3 days.
Pristine-Woodpecker@reddit
I am confused. If you are punishing models that want to think, why not just disable thinking to begin with?
kikoncuo@reddit
It's punishing overthinking, models that took more tokens to arrive to the correct answer.
If I say "hello" and you have to produce 2k tokens and 30s to say "hello" back, you are a worse model.
Pristine-Woodpecker@reddit
Then just disable thinking, as already said!
kikoncuo@reddit
If you do that you can't measure overthinking...
I want my models to think for tasks where thinking helps.
A model thinking for 2k tokens how to answer simple reasoning tasks is going to waste your time and resources for tasks that require thinking.
A model that answers the same questions correctly with 200 tokens thinking would be a lot better.
Far-Low-4705@reddit
yep
"max_tokens=1024"
This is not enough
phazei@reddit
I'd really like to see Qwen3.5 4b results with thinking disabled.
Dabber43@reddit
Isn't gemma 4b double the size of qwen 4b
Pristine-Woodpecker@reddit
Yup. It's really like 8B-A4B.
yrro@reddit
I wish that was in the model name
Sufficient-Bid3874@reddit
Yeah it's only 4b active not total, not sure why it was included. E2B would be more relevant
z_latent@reddit
Because it costs the same to run, and unlike MoE, you can keep the extra parameters in SSD with minimal impact on speed (15 vs 15.5 tok/s tg on a PCIe 3.0 laptop SSD)
llama.cpp still tries to load all parameters to memory on start-up, but even if you did not have enough, the OS should prioritize freeing any unused Per-Layer Embedding memory pages first, so it's fine.
CommunityTough1@reddit
Not sure what the others who answered you are smoking, but Gemma 4 E4B is not a MoE model. It's larger because the vision encoder is larger and it includes an audio input encoder as well. The LLM weights are still 4B.
Sufficient-Bid3874@reddit
We're not smoking.
From google officially
"The "E" in E2B and E4B stands for "effective" parameters. The smaller models incorporate Per-Layer Embeddings (PLE) to maximize parameter efficiency in on-device deployments. Rather than adding more layers or parameters to the model, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups, which is why the effective parameter count is much smaller than the total."
i.e It is a Matformer and those are in fact not the vision and audio encoder
CommunityTough1@reddit
Yeah, but those are just lookup table layers, not normal weights, and they're already counted towards the total parameter count, so they aren't technically adding anything to the size versus a normal 2B/4B.
Yu2sama@reddit
No, the model card explicitly says that E4B is 8b with the embedding. I don't understand this need on trying to be right when you clearly haven't even read the model card before making the first comment.
JEs4@reddit
It’s funny that it isn’t even 4B effective parameters but 4.5B
CommunityTough1@reddit
K.
Middle_Bullfrog_6173@reddit
It also includes per layer embeddings which are a large part of the weights. They can be offloaded without necessarily hurting speed, but they are used for text inference.
letsgoiowa@reddit
Yeah qwen 3.5 is just set up wrong. It behaved the exact same way until I fixed it. A quicker/dirtier way is to just use it from someone else who fixed it and put out a distill. It seems to be a really finicky model.
Physics-Affectionate@reddit
Would you give me a link for the corrected model link?
letsgoiowa@reddit
Sorry on the go right now but if you want a distill, Qwopus is pretty good on HF. Some people hate distills so if you're one of them that's ok but it works for me and fixes the issues.
Physics-Affectionate@reddit
I will try it out thank you
lilbyrdie@reddit
I don't understand the token budget. Why 1024? That seems artificially tiny. Why not 100,000 or 2 million or 10 million?
In real things, my inputs and outputs are usually 5 digits of tokens, sometimes 6. It's not clear to me what stunting the thinking models to just a handful of tokens does and why it's useful. In English, this is equivalent to about 1-2 pages of written prose.
I'm not trying to say it's not useful, I just don't understand the use. Results could be weighted by token use, rather than saying they're not accurate -- when in reality they just didn't finish.
While I realize tokens can be a measure of performance, or cost, in a real use case it's significantly more wasteful to stop something that hasn't finished because then you get nothing from the token use or time. So I think I'd rather know the efficiency than the accuracy when truncated, right? What am I missing.
So, I'm trying to understand the use cases of this way of measuring.
DefNattyBoii@reddit
Well, its not tiny but not a lot. Actually I'm using 256 or 1024, because I dont have all day. Still much better than no thinking at all!
WhiskyAKM@reddit
That wierd limit lobotomized Qwen3.5, per artificialanalysis.ai benchmarks it should perform best out of those
Nicoolodion@reddit
Learn to use colors.
cibernox@reddit
IMO, either you disable thinking or give them a generous thinking budget. 1024 is in no one’s land for thinking.
xrvz@reddit
150 people upvoted this garbage post.
AInterested_3664@reddit
The Nano(as far as i know) is optimized for short thinking, so it's probably the only thinking model out of the tested, that manages to finish it's thinking in the token limit. Qwen uses more thinking tokens so it can't fit in the budget and Gemma E4B(which is pretty good) is twice the size, so apples and oranges(as someone else said). Now, how Phi4 manages to keep with the competition is beyond me. Never found any practical use for the model, but apparently it tests good
FinBenton@reddit
Im still using Ministral 3 3b for filtering out ads in my app, I tried to upgrade to qwen 3.5 4b but after bunch of testing, I keep getting better results from ministral.
hwpoison@reddit
So, basically, nemotron nano is the best of all those?
po_stulate@reddit
Looks like it's phi4-mini 3.8b to me.
Dubious-Decisions@reddit
That would have been my choice too. Lower memory footprint and almost identical performance.
po_stulate@reddit
Also double tokens/s
darkpigvirus@reddit
that's right and it is emphasized in tokens / correct their that qwen 3.5 burns so much token to get the correct answer while others are not so qwen is eliminated here. nemotron is just slightly above compared to qwen but it gets the correct answer with fewer tokens. nemotron is just a beast
iamzooook@reddit
idk for me it was qwen till gemma4 came out. the rest is just noise.
Darklumiere@reddit
Comparing thinking and non thinking models with such a limited context window is Apples to Oranges. Why didn't you disable thinking for the thinking models?
Far-Low-4705@reddit
`max_tokens=1024`
This is not nearly enough. this is why qwen scored so low, it was only able to generate an answer to 8% of the problems.
with qwen3.5, especially smaller models, even just "hi" can result in 7k output tokens
myglasstrip@reddit
I don't understand the graphs. If the token budget is 1024, how does qwen have 5k per correct answer? Did you penalize/use all answers? These graphs are a little odd, they don't really represent the text you gave. Quoting system ram in the graphs is confusing when you're giving score/gb. Bottom 3 graphs were more confusing than helpful.
Velocita84@reddit
Yeah no an ancient model like phi4 winning over gemma 4 and qwen3.5 already tells you that this benchmark is garbage
Witty_Mycologist_995@reddit
Either that or models have become more benchmaxxed over time and are bad at generalizing to new tasks.
stoppableDissolution@reddit
"ancient" as in a month old?
Jarpex@reddit
It was released on December 12, 2024.
Arsene_Yuka_1980@reddit
How is the two year old phi 4 mini beating the latest models?
ikkiho@reddit
Two things that would change the read on this benchmark.
The 1024-token cap is the same issue from last week, just under different framing. Finance tasks (P/E, NPV, CAGR, Sharpe) have short final answers, so any thinking budget that fits inside the cap finishes. Reasoning tasks have closer to 5-10x the working length, and that's exactly where Qwen 3.5 dies in your table. Pristine-Woodpecker's frame ("disable thinking entirely") is half right; the cleaner fix is per-task-class budgets. Finance 256, code 512, reasoning 2048. Each model gets enough room to finish on its natural verbosity profile, and the ranking stops being about "which model fits in 1024 tokens" and starts being about "which model is correct."
The lineup is mixed activation regime, which the comments below already flagged. gemma4:e4b is 9B-A4B sparse, nemotron-3-nano:4b and qwen3.5:4b are dense, granite4:3b is 3B dense, phi4-mini is 3.8B dense. At the same VRAM budget the sparse model gets roughly 2x effective parameters; at the same active-param count it pays a memory tax. The benchmark collapses those two regimes into one column. Either lock the lineup to all-dense-at-4B (drop gemma4:e4b, add a dense 4B if available), or split the table with an active-vs-total-param column so readers can see what they're paying for.
Two methodology notes since you have the gist out. (1) temp=0 with a fixed seed should make 3 trials near-identical, so the median isn't smoothing anything real. Drop the trials and either run more tasks or report tokens-to-correct as a primary metric. Qwen burning 1500 tokens to get the right answer is a meaningfully different signal from Nemotron getting it in 300, and pass@1-with-cap erases both. (2) 39 tasks across 5 models with no bootstrap CI gives you roughly +/-10% confidence intervals on the per-category numbers. "100% finance vs 80% finance" could be a 12/15 vs 15/15 split that's barely 1.5 sigma apart. Either bump the task count or drop the precision claims.
Bottom line: the headline that Nemotron Nano won at this budget is probably true, but it's a result about the eval frame, not the model class.
swfsql@reddit
Instead of capping it entirely, why not come up with some random formula that favors lowish token requirements?
Like something that has no score punishment until X tokens, and then it has some punishment after that, and leave the token budget at infinite.
I mean, the "current formula" just throws the model out the window in its entirety..
CockBrother@reddit
Since you're using ollama - did you just use whatever the default quantization for the model(s) were?
It's kind of difficult to compare the models without going straight to the originals.
Ariquitaun@reddit
Thank you. I'm anecdotally seeing the same reasoning problem on qwen3. 35b a3b - mega verbose thinking progress causing issues with token completion allowances
o0genesis0o@reddit
When I tested a few weeks ago, nemotron nano has some problems with llamacpp so prompt caching was disabled. Unless this error is fixed, it is just not a practical model to use.
MerePotato@reddit
I need to give Nemo a look huh
AltamiroMi@reddit
Would this models be viable for local run on the preparation of standard documents based on information given by the user ? Based on available rag of examples of said documents and templates pre made ?
hamletscatrex@reddit
!RemindMe 1 week
whoisyurii@reddit
!RemindMe 1 week
RemindMeBot@reddit
I will be messaging you in 7 days on 2026-05-04 20:09:15 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
DeltaSqueezer@reddit
For Qwen, you need to control the thinking externally. You run with limited thinking tokens. Then terminate it at that thinking budget, then run it again after closing thinking to get the anwer.
Clear-Ad-9312@reddit
I wonder how Qwen 3.5 would perform with that GBNF trick in other threads are showing reduces the long form thinking some models prefer to do.
met_MY_verse@reddit
!RemindMe 1 week
RemindMeBot@reddit
I will be messaging you in 7 days on 2026-05-04 18:47:17 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)