Gemma 4 31B vs Gemma 4 26B-A4B vs Qwen 3.5 27B — 30-question blind eval with Claude Opus 4.6 as judge

Posted by Silver_Raspberry_811@reddit | LocalLLaMA | View on Reddit | 76 comments

Just finished a 3-way head-to-head. Sharing the raw results because this sub has been good about poking holes in methodology, and I'd rather get that feedback than pretend my setup is perfect.

Setup

Win counts (highest score on each question)

Model Wins Win %
Qwen 3.5 27B 14 46.7%
Gemma 4 31B 12 40.0%
Gemma 4 26B-A4B 4 13.3%

Average scores

Model Avg Score Evals
Gemma 4 31B 8.82 30
Gemma 4 26B-A4B 8.82 28
Qwen 3.5 27B 8.17 30

Before you ask — yes, Qwen wins more matchups but has a lower average. That's because it got three 0.0 scores (CODE-001, REASON-004, ANALYSIS-017). Those look like format failures or refusals, not genuinely terrible answers. Strip those out and Qwen's average jumps to \~9.08, highest of the three. So the real story might be: Qwen 3.5 27B is the best model here when it doesn't choke, but it chokes 10% of the time.

Category breakdown

Category Leader
Code Tied — Gemma 4 31B and Qwen (3 each)
Reasoning Qwen dominates (5 of 6)
Analysis Qwen dominates (4 of 6)
Communication Gemma 4 31B dominates (5 of 6)
Meta-alignment Three-way split (2-2-2)

Other things I noticed

Methodology caveats (since this sub rightfully cares)

Happy to share the raw per-question scores if anyone wants to dig in. What's your experience been running Gemma 4 locally? Curious if the latency spikes I saw are consistent across different quant levels.