The 4B class of 2026 (benchmark)

Posted by FederalAnalysis420@reddit | LocalLLaMA | View on Reddit | 58 comments

Bench 2 from my 18GB M3 Pro. Last week was specialists vs generalists at 7-8B (which I hosed by giving thinking models a 128-token budget, so half the post was an apology). This week: the 4B class of 2026, every model released or actively-current at the 3-4B size, head-to-head on the same task suite.

Lineup (sizes on disk): gemma4:e4b 9.6 GB Google, Apr 2 2026 qwen3.5:4b 3.4 GB Alibaba, Mar 1 2026 granite4:3b 2.1 GB IBM, Oct 2025 nemotron-3-nano:4b 2.8 GB NVIDIA, Mar 2026 phi4-mini:3.8b 2.5 GB Microsoft, late 2024

39 tasks: 15 finance (P/E, NPV, CAGR, Sharpe), 15 reasoning (word problems, syllogisms, probability), 9 code (FizzBuzz-tier). 3 trials per (model × task), median aggregation. temp=0, seed=42, max_tokens=1024.

Headline: Nemotron 3 Nano won and it's not close

model overall finance reasoning code nemotron-3-nano:4b 85% 100% 80% 67% phi4-mini:3.8b 77% 80% 60% 100% gemma4:e4b 62% 60% 60% 67% granite4:3b 54% 60% 20% 100% qwen3.5:4b 15% 20% 20% 0%

NVIDIA's nano is barely a month old and went 15-for-15 on finance. Looking at the responses (visible in the gist), it's a thinking model, </think> tags before final answers, and it actually finishes its thinking inside the 1024-token budget. The reasoning is clean: "compute (1.08)^5. 1.08^2=1.1664, ^3=1.259712, ^4=1.36048896, ^5=1.4693280768. So PV = 100,000 / 1.4693280768 = approx 68,058."

That's a 2.8 GB model on disk producing the right answer with the right intermediate work. On finance specifically, it beat every larger model.

Lab personalities are real at this size

Look at the per-category lines for granite4:3b vs nemotron-3-nano:4b:

granite: code 100%, reasoning 20% nemotron: code 67%, reasoning 80%

Two ~3-4 GB models, almost-mirror-image profiles. Granite is a dedicated coder with weak reasoning. Nemotron is a dedicated reasoner with mediocre code. Both come from labs (IBM, NVIDIA) that don't position these as specialist models, they're marketed as general-purpose at this size. The marketing is wrong; the data shows clear specialization.

phi4-mini sits in between: 100% on code, 80% on finance, 60% on reasoning. The most balanced of the bunch and the bang-for-GB winner at 30.8 accuracy-pct per GB on disk.

The Qwen 3.5 4b problem

15% accuracy. 30 of 39 responses empty (avg response length: 21 chars out of a 1024-token budget). Same failure mode as Qwen3:4b in bench 1 four months ago. Thinking model that can't finish thinking inside a fixed budget that's reasonable for non-thinking models in the same weight class.

Looking at one of the truncated responses: it gets to "$$PV = \frac{100,000}{(1 + 0.08)^5}$$" and runs out of budget mid-formula. The model isn't broken; my budget gave thinking models 1024 tokens when they need 4096+ to finish. Granite finishes in ~75 tokens average, Nemotron in ~170, Qwen 3.5 4b is using its full 914 tokens on visible-plus-hidden output and still not finishing.

This is now a pattern across two bench posts. The eval ecosystem has a thinking-model-in-fixed-budget problem and I don't think the answer is "make the budget bigger", that punishes the non-thinkers with bloated runs and obscures what's actually being measured.

I'm going to try per-model token budgets in bench 3. Open to better ideas, comment if you have them.

Methodology + repo

Apple M3 Pro, 18 GB, macOS 25.5, Ollama 0.21. temp=0, seed=42, max_tokens=1024 across all models (this is the design flaw above). 3 trials per task, median aggregation. All graders are deterministic regex/numeric/exec, no LLM-as-judge.

Repo: https://github.com/joshuahickscorp/bench2 Raw JSONL with full responses + per-token timings: https://gist.github.com/joshuahickscorp/1e8947e2f14dea0930f6f33d987c335e

Up next

Bench 3: lab personalities deep-dive. Should land in 3 days.

[-]

ikkiho@reddit

Two things that would change the read on this benchmark.

The 1024-token cap is the same issue from last week, just under different framing. Finance tasks (P/E, NPV, CAGR, Sharpe) have short final answers, so any thinking budget that fits inside the cap finishes. Reasoning tasks have closer to 5-10x the working length, and that's exactly where Qwen 3.5 dies in your table. Pristine-Woodpecker's frame ("disable thinking entirely") is half right; the cleaner fix is per-task-class budgets. Finance 256, code 512, reasoning 2048. Each model gets enough room to finish on its natural verbosity profile, and the ranking stops being about "which model fits in 1024 tokens" and starts being about "which model is correct."

The lineup is mixed activation regime, which the comments below already flagged. gemma4:e4b is 9B-A4B sparse, nemotron-3-nano:4b and qwen3.5:4b are dense, granite4:3b is 3B dense, phi4-mini is 3.8B dense. At the same VRAM budget the sparse model gets roughly 2x effective parameters; at the same active-param count it pays a memory tax. The benchmark collapses those two regimes into one column. Either lock the lineup to all-dense-at-4B (drop gemma4:e4b, add a dense 4B if available), or split the table with an active-vs-total-param column so readers can see what they're paying for.

Two methodology notes since you have the gist out. (1) temp=0 with a fixed seed should make 3 trials near-identical, so the median isn't smoothing anything real. Drop the trials and either run more tasks or report tokens-to-correct as a primary metric. Qwen burning 1500 tokens to get the right answer is a meaningfully different signal from Nemotron getting it in 300, and pass@1-with-cap erases both. (2) 39 tasks across 5 models with no bootstrap CI gives you roughly +/-10% confidence intervals on the per-category numbers. "100% finance vs 80% finance" could be a 12/15 vs 15/15 split that's barely 1.5 sigma apart. Either bump the task count or drop the precision claims.

Bottom line: the headline that Nemotron Nano won at this budget is probably true, but it's a result about the eval frame, not the model class.

[-]

swfsql@reddit

Instead of capping it entirely, why not come up with some random formula that favors lowish token requirements?

Like something that has no score punishment until X tokens, and then it has some punishment after that, and leave the token budget at infinite.

I mean, the "current formula" just throws the model out the window in its entirety..

Pristine-Woodpecker@reddit

I am confused. If you are punishing models that want to think, why not just disable thinking to begin with?

kikoncuo@reddit

It's punishing overthinking, models that took more tokens to arrive to the correct answer.

If I say "hello" and you have to produce 2k tokens and 30s to say "hello" back, you are a worse model.

Then just disable thinking, as already said!

If you do that you can't measure overthinking...

I want my models to think for tasks where thinking helps.

A model thinking for 2k tokens how to answer simple reasoning tasks is going to waste your time and resources for tasks that require thinking.

A model that answers the same questions correctly with 200 tokens thinking would be a lot better.

Far-Low-4705@reddit

yep

"max_tokens=1024"

This is not enough

phazei@reddit

I'd really like to see Qwen3.5 4b results with thinking disabled.

Dabber43@reddit

Isn't gemma 4b double the size of qwen 4b

Yup. It's really like 8B-A4B.

yrro@reddit

I wish that was in the model name

Sufficient-Bid3874@reddit

Yeah it's only 4b active not total, not sure why it was included. E2B would be more relevant

z_latent@reddit

not sure why it was included.

Because it costs the same to run, and unlike MoE, you can keep the extra parameters in SSD with minimal impact on speed (15 vs 15.5 tok/s tg on a PCIe 3.0 laptop SSD)

llama.cpp still tries to load all parameters to memory on start-up, but even if you did not have enough, the OS should prioritize freeing any unused Per-Layer Embedding memory pages first, so it's fine.

CommunityTough1@reddit

Not sure what the others who answered you are smoking, but Gemma 4 E4B is not a MoE model. It's larger because the vision encoder is larger and it includes an audio input encoder as well. The LLM weights are still 4B.

We're not smoking.
From google officially
"The "E" in E2B and E4B stands for "effective" parameters. The smaller models incorporate Per-Layer Embeddings (PLE) to maximize parameter efficiency in on-device deployments. Rather than adding more layers or parameters to the model, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups, which is why the effective parameter count is much smaller than the total."

i.e It is a Matformer and those are in fact not the vision and audio encoder

Yeah, but those are just lookup table layers, not normal weights, and they're already counted towards the total parameter count, so they aren't technically adding anything to the size versus a normal 2B/4B.

Yu2sama@reddit

No, the model card explicitly says that E4B is 8b with the embedding. I don't understand this need on trying to be right when you clearly haven't even read the model card before making the first comment.

JEs4@reddit

It’s funny that it isn’t even 4B effective parameters but 4.5B

Middle_Bullfrog_6173@reddit

It also includes per layer embeddings which are a large part of the weights. They can be offloaded without necessarily hurting speed, but they are used for text inference.

letsgoiowa@reddit

Yeah qwen 3.5 is just set up wrong. It behaved the exact same way until I fixed it. A quicker/dirtier way is to just use it from someone else who fixed it and put out a distill. It seems to be a really finicky model.

Physics-Affectionate@reddit

Would you give me a link for the corrected model link?

Sorry on the go right now but if you want a distill, Qwopus is pretty good on HF. Some people hate distills so if you're one of them that's ok but it works for me and fixes the issues.

I will try it out thank you

lilbyrdie@reddit

I don't understand the token budget. Why 1024? That seems artificially tiny. Why not 100,000 or 2 million or 10 million?

In real things, my inputs and outputs are usually 5 digits of tokens, sometimes 6. It's not clear to me what stunting the thinking models to just a handful of tokens does and why it's useful. In English, this is equivalent to about 1-2 pages of written prose.

I'm not trying to say it's not useful, I just don't understand the use. Results could be weighted by token use, rather than saying they're not accurate -- when in reality they just didn't finish.

While I realize tokens can be a measure of performance, or cost, in a real use case it's significantly more wasteful to stop something that hasn't finished because then you get nothing from the token use or time. So I think I'd rather know the efficiency than the accuracy when truncated, right? What am I missing.

So, I'm trying to understand the use cases of this way of measuring.

DefNattyBoii@reddit

Well, its not tiny but not a lot. Actually I'm using 256 or 1024, because I dont have all day. Still much better than no thinking at all!

WhiskyAKM@reddit

That wierd limit lobotomized Qwen3.5, per artificialanalysis.ai benchmarks it should perform best out of those

Nicoolodion@reddit

Learn to use colors.

cibernox@reddit

IMO, either you disable thinking or give them a generous thinking budget. 1024 is in no one’s land for thinking.

xrvz@reddit

150 people upvoted this garbage post.

AInterested_3664@reddit

The Nano(as far as i know) is optimized for short thinking, so it's probably the only thinking model out of the tested, that manages to finish it's thinking in the token limit. Qwen uses more thinking tokens so it can't fit in the budget and Gemma E4B(which is pretty good) is twice the size, so apples and oranges(as someone else said). Now, how Phi4 manages to keep with the competition is beyond me. Never found any practical use for the model, but apparently it tests good

FinBenton@reddit

Im still using Ministral 3 3b for filtering out ads in my app, I tried to upgrade to qwen 3.5 4b but after bunch of testing, I keep getting better results from ministral.

hwpoison@reddit

So, basically, nemotron nano is the best of all those?

po_stulate@reddit

Looks like it's phi4-mini 3.8b to me.

Dubious-Decisions@reddit

That would have been my choice too. Lower memory footprint and almost identical performance.

Also double tokens/s

darkpigvirus@reddit

that's right and it is emphasized in tokens / correct their that qwen 3.5 burns so much token to get the correct answer while others are not so qwen is eliminated here. nemotron is just slightly above compared to qwen but it gets the correct answer with fewer tokens. nemotron is just a beast

iamzooook@reddit

idk for me it was qwen till gemma4 came out. the rest is just noise.

Darklumiere@reddit

Comparing thinking and non thinking models with such a limited context window is Apples to Oranges. Why didn't you disable thinking for the thinking models?

`max_tokens=1024`

This is not nearly enough. this is why qwen scored so low, it was only able to generate an answer to 8% of the problems.

with qwen3.5, especially smaller models, even just "hi" can result in 7k output tokens

myglasstrip@reddit

I don't understand the graphs. If the token budget is 1024, how does qwen have 5k per correct answer? Did you penalize/use all answers? These graphs are a little odd, they don't really represent the text you gave. Quoting system ram in the graphs is confusing when you're giving score/gb. Bottom 3 graphs were more confusing than helpful.

Velocita84@reddit

Yeah no an ancient model like phi4 winning over gemma 4 and qwen3.5 already tells you that this benchmark is garbage

Witty_Mycologist_995@reddit

Either that or models have become more benchmaxxed over time and are bad at generalizing to new tasks.

stoppableDissolution@reddit

"ancient" as in a month old?

Jarpex@reddit

It was released on December 12, 2024.

Arsene_Yuka_1980@reddit

How is the two year old phi 4 mini beating the latest models?

CockBrother@reddit

Since you're using ollama - did you just use whatever the default quantization for the model(s) were?

It's kind of difficult to compare the models without going straight to the originals.

Ariquitaun@reddit

Thank you. I'm anecdotally seeing the same reasoning problem on qwen3. 35b a3b - mega verbose thinking progress causing issues with token completion allowances

o0genesis0o@reddit

When I tested a few weeks ago, nemotron nano has some problems with llamacpp so prompt caching was disabled. Unless this error is fixed, it is just not a practical model to use.

MerePotato@reddit

I need to give Nemo a look huh

AltamiroMi@reddit

Would this models be viable for local run on the preparation of standard documents based on information given by the user ? Based on available rag of examples of said documents and templates pre made ?

hamletscatrex@reddit

!RemindMe 1 week

whoisyurii@reddit

RemindMeBot@reddit

I will be messaging you in 7 days on 2026-05-04 20:09:15 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)