Gemma 4 31B vs Gemma 4 26B-A4B vs Qwen 3.5 27B — 30-question blind eval with Claude Opus 4.6 as judge

Posted by Silver_Raspberry_811@reddit | LocalLLaMA | View on Reddit | 76 comments

Just finished a 3-way head-to-head. Sharing the raw results because this sub has been good about poking holes in methodology, and I'd rather get that feedback than pretend my setup is perfect.

Setup

30 questions, 6 per category (code, reasoning, analysis, communication, meta-alignment)
All three models answer the same question blind — no system prompt differences, same temperature
Claude Opus 4.6 judges each response independently on a 0-10 scale with a structured rubric (not "which is better," but absolute scoring per response)
Single judge, no swap-and-average this run — I know that introduces positional bias risk, but Opus 4.6 had a 99.9% parse rate in prior batches so I prioritized consistency over multi-judge noise
Total cost: $4.50

Win counts (highest score on each question)

Model	Wins	Win %
Qwen 3.5 27B	14	46.7%
Gemma 4 31B	12	40.0%
Gemma 4 26B-A4B	4	13.3%

Average scores

Model	Avg Score	Evals
Gemma 4 31B	8.82	30
Gemma 4 26B-A4B	8.82	28
Qwen 3.5 27B	8.17	30

Before you ask — yes, Qwen wins more matchups but has a lower average. That's because it got three 0.0 scores (CODE-001, REASON-004, ANALYSIS-017). Those look like format failures or refusals, not genuinely terrible answers. Strip those out and Qwen's average jumps to \~9.08, highest of the three. So the real story might be: Qwen 3.5 27B is the best model here when it doesn't choke, but it chokes 10% of the time.

Category breakdown

Category	Leader
Code	Tied — Gemma 4 31B and Qwen (3 each)
Reasoning	Qwen dominates (5 of 6)
Analysis	Qwen dominates (4 of 6)
Communication	Gemma 4 31B dominates (5 of 6)
Meta-alignment	Three-way split (2-2-2)

Other things I noticed

Gemma 4 26B-A4B (the MoE variant) errored out on 2 questions entirely. When it worked, its scores matched the dense 31B almost exactly — same 8.82 average. Interesting efficiency story if Google cleans up the reliability.
Gemma 4 31B had some absurdly long response times — multiple 5-minute generations. Looks like heavy internal chain-of-thought. Didn't correlate with better scores.
Qwen 3.5 27B generates 3-5x more tokens per response on average. Verbosity tax is real but the judge didn't seem to penalize or reward it consistently.

Methodology caveats (since this sub rightfully cares)

30 questions is a small sample. I'm not claiming statistical significance, just sharing signal.
Single judge (Opus 4.6) means any systematic bias it has will show up in every score. I've validated it against multi-judge panels before and it tracked well, but it's still one model's opinion.
LLM-as-judge has known issues: verbosity bias, self-preference bias, positional bias. I use absolute scoring (not pairwise comparison) to reduce some of this, but it's not eliminated.
Questions are my own, not pulled from a standard benchmark. That means they're not contaminated, but they also reflect my biases about what matters.

Happy to share the raw per-question scores if anyone wants to dig in. What's your experience been running Gemma 4 locally? Curious if the latency spikes I saw are consistent across different quant levels.

[-]

infalleeble@reddit

is this just bots talking to bots at this point?

...about LLM's reviewing other LLM's??

[-]

No-Educator-249@reddit

It seems so. Reddit really needs to implement that human verification system soon. It's so easy to tell the bots apart by their use of those long em dashes.

[-]

high_funtioning_mess@reddit

I think same LLM that answered the question acting as a judge to review it own answer has some merits to it. It shows it how good the model is at judging/knowing whether the answer/thought process is correct or not.

[-]

infalleeble@reddit

take a look at ops post/comment history

[-]

Wildnimal@reddit

Good stuff. You should have added the 35-A3B from Qwen, since you compared a MOE model from Gemma there.

[-]