Gemma 4 31B vs Gemma 4 26B-A4B vs Qwen 3.5 27B — 30-question blind eval with Claude Opus 4.6 as judge
Posted by Silver_Raspberry_811@reddit | LocalLLaMA | View on Reddit | 76 comments
Just finished a 3-way head-to-head. Sharing the raw results because this sub has been good about poking holes in methodology, and I'd rather get that feedback than pretend my setup is perfect.
Setup
- 30 questions, 6 per category (code, reasoning, analysis, communication, meta-alignment)
- All three models answer the same question blind — no system prompt differences, same temperature
- Claude Opus 4.6 judges each response independently on a 0-10 scale with a structured rubric (not "which is better," but absolute scoring per response)
- Single judge, no swap-and-average this run — I know that introduces positional bias risk, but Opus 4.6 had a 99.9% parse rate in prior batches so I prioritized consistency over multi-judge noise
- Total cost: $4.50
Win counts (highest score on each question)
| Model | Wins | Win % |
|---|---|---|
| Qwen 3.5 27B | 14 | 46.7% |
| Gemma 4 31B | 12 | 40.0% |
| Gemma 4 26B-A4B | 4 | 13.3% |
Average scores
| Model | Avg Score | Evals |
|---|---|---|
| Gemma 4 31B | 8.82 | 30 |
| Gemma 4 26B-A4B | 8.82 | 28 |
| Qwen 3.5 27B | 8.17 | 30 |
Before you ask — yes, Qwen wins more matchups but has a lower average. That's because it got three 0.0 scores (CODE-001, REASON-004, ANALYSIS-017). Those look like format failures or refusals, not genuinely terrible answers. Strip those out and Qwen's average jumps to \~9.08, highest of the three. So the real story might be: Qwen 3.5 27B is the best model here when it doesn't choke, but it chokes 10% of the time.
Category breakdown
| Category | Leader |
|---|---|
| Code | Tied — Gemma 4 31B and Qwen (3 each) |
| Reasoning | Qwen dominates (5 of 6) |
| Analysis | Qwen dominates (4 of 6) |
| Communication | Gemma 4 31B dominates (5 of 6) |
| Meta-alignment | Three-way split (2-2-2) |
Other things I noticed
- Gemma 4 26B-A4B (the MoE variant) errored out on 2 questions entirely. When it worked, its scores matched the dense 31B almost exactly — same 8.82 average. Interesting efficiency story if Google cleans up the reliability.
- Gemma 4 31B had some absurdly long response times — multiple 5-minute generations. Looks like heavy internal chain-of-thought. Didn't correlate with better scores.
- Qwen 3.5 27B generates 3-5x more tokens per response on average. Verbosity tax is real but the judge didn't seem to penalize or reward it consistently.
Methodology caveats (since this sub rightfully cares)
- 30 questions is a small sample. I'm not claiming statistical significance, just sharing signal.
- Single judge (Opus 4.6) means any systematic bias it has will show up in every score. I've validated it against multi-judge panels before and it tracked well, but it's still one model's opinion.
- LLM-as-judge has known issues: verbosity bias, self-preference bias, positional bias. I use absolute scoring (not pairwise comparison) to reduce some of this, but it's not eliminated.
- Questions are my own, not pulled from a standard benchmark. That means they're not contaminated, but they also reflect my biases about what matters.
Happy to share the raw per-question scores if anyone wants to dig in. What's your experience been running Gemma 4 locally? Curious if the latency spikes I saw are consistent across different quant levels.
infalleeble@reddit
is this just bots talking to bots at this point?
...about LLM's reviewing other LLM's??
No-Educator-249@reddit
It seems so. Reddit really needs to implement that human verification system soon. It's so easy to tell the bots apart by their use of those long em dashes.
high_funtioning_mess@reddit
I think same LLM that answered the question acting as a judge to review it own answer has some merits to it. It shows it how good the model is at judging/knowing whether the answer/thought process is correct or not.
infalleeble@reddit
take a look at ops post/comment history
Wildnimal@reddit
Good stuff. You should have added the 35-A3B from Qwen, since you compared a MOE model from Gemma there.
Silver_Raspberry_811@reddit (OP)
You're right — Qwen 3.5 35B-A3B vs Gemma 4 26B-A4B is the more meaningful MoE comparison. Multiple people have flagged this. Queuing it as the next H2H.
corpo_monkey@reddit
What do you mean same temperature?
Temperature is not universal, every LLM has its own preference, and different tasks require different temps.
It's like every athlete must use the same shoe size.
Igot1forya@reddit
You bring up a great point. What's the most accurate way to know the proper temp for each model, is there a baked in default that's optimal? I see these stats on the model cards but, is there a central repository or public database that defines each models ideal temp for each situation? I'd love to setup agents and have a model router just "know" this information and hit the ground running. Every day that goes by I find myself further and further kicking myself for drawing conclusions about models and it turns out I'm just using it wrong or picking the wrong tool for the job.
Silver_Raspberry_811@reddit (OP)
There isn't a great central repository for this unfortunately. Model cards on HuggingFace sometimes list recommended settings, but they're inconsistent — Google recommends temp 1.0 for everything on Gemma 4, which most people here disagree with for non-creative tasks. The reality is it's still trial-and-error per model per task type.
For a model router setup, you'd probably want to store per-model configs as metadata and load them dynamically. It's an infrastructure problem more than a knowledge problem — someone just needs to build and maintain the lookup table.
Silver_Raspberry_811@reddit (OP)
Fair point. I used the same temperature (0.7) and max_tokens (2048) across all three models through OpenRouter's API defaults. You're right that each model has its own optimal inference settings — Gemma 4's docs recommend temp 1.0, Qwen has its own recommended params.
This is the same feedback I got on my Qwen batch two weeks ago and it's still a real limitation. Running through an API means I don't control the full inference config. The tradeoff is reproducibility (anyone can hit the same OpenRouter endpoint) vs optimality (each model running at its best). I'm planning a local rerun with model-specific settings to quantify how much it matters.
Far-Low-4705@reddit
2048 is not enough tokens
Silver_Raspberry_811@reddit (OP)
You're right. 2048 max_tokens is capping responses on complex multi-part questions — that's likely why some models scored low on completeness. They ran out of space before covering all sub-parts. I'll bump to 4096 for the next batch. Good catch.
_spacious_joy_@reddit
Why would people downvoted you for this? You're doing great.
Reddit is such a bucket of bitter nerds.
VicemanPro@reddit
Doing great? He has no idea what he's doing. First of all you can control temp and other parameters through the API, he admitted he received this feedback last time, and still does it with defaults.
Then he's saying it's a good catch that 2048 tokens isn't enough when it's obvious by his results. Plus every respond he's running through an LLM, it's just disingenuous and he's clearly not listening to what people are saying.
_spacious_joy_@reddit
Fair point, I think you're right!
LoaderD@reddit
Because anyone who has worked with LLMs or read any paper on LLM evaluations/judges would know 2048 tokens isn’t enough to get a reasonable comparison.
IrisColt@reddit
I was about to write this.
anomaly256@reddit
This was my thought too
Specialist_Golf8133@reddit
wait the MoE version is getting smoked by the dense model? that's kinda wild actually. thought the whole point of going sparse was you get more capability for the same compute but this is showing the opposite. makes me wonder if we're gonna see a pendulum swing back to dense models once people realize activation efficiency matters less than just raw quality for local use
Silver_Raspberry_811@reddit (OP)
It's less that MoE is getting smoked and more that it matched the dense model almost exactly (both 8.82 average) while activating fewer parameters. The wins gap (4 vs 12) looks bad but with only 30 questions that's within noise range. The real story is that the MoE variant errored out on 2 questions entirely — reliability, not capability, is the issue right now.
I don't think it's a pendulum swing back to dense so much as MoE needing another generation of stability work. If Google fixes the reliability issues, the efficiency argument is strong.
virtualunc@reddit
the MoE numbers on gemma 4 26b are wild.. getting close to the dense 31b while being way cheaper to run. appreciate the methodology transparency too, most people just post "X is better" with zero context on how they tested
did you notice any diffrence in longer context performance? imo thats where the real gap shows up between these models
Silver_Raspberry_811@reddit (OP)
Good question — I didn't test longer context specifically in this batch. All 30 questions were single-turn, relatively short prompts. Context window stress testing (8K, 16K, 32K+ input) is a different eval entirely and would probably show bigger gaps between these models than what I found here. Worth designing a dedicated long-context batch for.
And yeah, the MoE efficiency story is the sleeper finding here. Same average score at significantly lower compute is meaningful for local deployment.
RegularHumanMan001@reddit
The single-judge / absolute scoring tradeoff you made is reasonable but the part worth interrogating is whether claude opus 4.6 has consistent sensitivity across all five question categories. judges tend to have strong preferences for certain response styles that show up unevenly across task types you might get reliable signal on reasoning and code where there are more objective markers, but communication and meta-alignment are exactly where bias and self-preference creep in most. The 3-5x token gap from qwen is probably what's driving the lower average despite winning more questions.
Would definitely be worth swapping out the judge model maybe try using a smaller more focused model?
Silver_Raspberry_811@reddit (OP)
You're hitting the exact issue I've been wrestling with. Opus 4.6 is strong on code and reasoning where there are objective markers, but you're right that communication and meta-alignment are where its preferences bleed through most. I actually have the per-category judge variance data from 150 prior frontier evals — score distributions are tighter on code/reasoning and wider on communication, which supports your point.
The token gap driving the wins-vs-average split is almost certainly what's happening. Qwen's three 0.0 scores (likely format failures) tank the average while not affecting win count. Strip those and it's the highest scorer by a clear margin.
On swapping the judge — I've considered it but the tradeoff is parse reliability. Opus 4.6 hit 99.9% across 1,067 judgments. Smaller models I've tested drop to 85-90% and introduce their own biases. Multi-judge panels where you average across 2-3 models is probably the real answer. That's on the roadmap.
ambient_temp_xeno@reddit
LLM as judge = no thanks.
It also epends how you're running Gemma 4 for the test. The new custom parser for gemma 4 in llama.cpp b8665 has fixed it for me. Before, it failed the test of just being given the image below. Now it solves it.
high_funtioning_mess@reddit
[Copying my reply from another comment]
I think same LLM that answered the question acting as a judge to review it own answer has some merits to it. It shows it how good the model is at judging/knowing whether the answer/thought process is correct or not.
Silver_Raspberry_811@reddit (OP)
Interesting — hadn't seen braintwin before. I'll take a look. My eval engine is open-source (github.com/themultivac/multivac-evaluation) so if there's overlap or interoperability worth exploring I'm open to it.
Silver_Raspberry_811@reddit (OP)
Understood. LLM-as-judge has real limitations — verbosity bias, self-preference, positional effects. I use it because the alternative (human evaluation at scale) costs 100x more and I'm one person with no funding.
What I can say: Claude Opus 4.6 had a 99.9% parse rate across 1,067 prior judgments, scored in the 7.33 average range (not inflating everything to 9+), and when I compared its rankings against a full 10-model peer matrix, they correlated at 73%. Not perfect. Better than nothing.
The human baseline study is on the roadmap — comparing AI judge rankings against human preferences on the same questions. That's the only way to settle this properly.
Good to know about the llama.cpp b8665 parser fix. I'll note that for anyone running Gemma 4 locally.
Far-Low-4705@reddit
Use programmatic scoring.
Write unit test cases for coding problems, test for exact matches in math problems, score multiple choice questions etc. all of these are better because they give objective results, not opinionated results
Silver_Raspberry_811@reddit (OP)
Agreed that programmatic scoring is more objective where it applies. For code questions I'm building automated pytest execution — run the model's code against test cases, pass/fail, no opinion involved. For math with exact answers, same thing.
The challenge is that 4 of my 5 categories (reasoning, analysis, communication, meta-alignment) don't have clean programmatic answers. "Write a technical proposal" or "explain the flaw in this reasoning" can't be unit tested. LLM-as-judge is the fallback for those.
The roadmap is: programmatic scoring for code and math, LLM-as-judge for everything else, human baseline study to validate the judge. Hybrid approach, not all-or-nothing.
high_funtioning_mess@reddit
Interesting! Can you run your custom benchmarks on this free website and share the benchmark results so that we can see the tests and compare how good this is?
It supports llm as a judge as well.
https://benchmark.braintwin.ai
spky-dev@reddit
30 questions is an incredibly insignificant sample size.
Sadman782@reddit
I don't know how you ran it, if you're running it locally using llama.cpp, use the b8660 llama.cpp build (more recent versions have a regression, another tokenization issue) and follow this: https://www.reddit.com/r/LocalLLaMA/comments/1scw979/gemma_4_for_16_gb_vram
I am sure the 26B will do much better.
StardockEngineer@reddit
Is it still regressed? You didn't link to an issue so I can't tell without digging. Help us out :)
PKJedi@reddit
These tips work nicely, thank you!
I have a couple offtopic questions, if you don't mind. Others are free to answer of course.
- Where do you recommend following LocalLLaMA related discussion? Here, the LocalLLM Discord, more?
- Which harness(es) do you recommend for coding with limited size models? OpenCode / Pi / Qwen / other? Any recent guides or important config tips?
Silver_Raspberry_811@reddit (OP)
Hey thanks for stopping by, if you are interested in more: https://discord.gg/2V3dg7hc
RelicDerelict@reddit
No Discord please!
Silver_Raspberry_811@reddit (OP)
Great catches on both fronts. I didn't know about the llama.cpp regression — I ran these through OpenRouter's API so the inference stack is whatever their provider uses, which I can't control. That's a real limitation I should call out more explicitly.
Your judge prompt suggestion is interesting. I use absolute scoring (each response scored independently) specifically to avoid anchoring effects from seeing other responses. But your approach of scoring all responses to the same question together would give better relative ranking. Worth testing both and comparing. I'll try a side-by-side on the next batch.
The recommended sampler settings are noted — I'll add those as a caveat and consider a local rerun with your params. If you want to follow along or run the same questions locally, the engine and all questions are open-source: github.com/themultivac/multivac-evaluation
spaceman_@reddit
Where do those values come from? Google recommends different values on their release page.
Sadman782@reddit
by testing myself.
It depends on your use case, for creative writing or other creative tasks temp 1 might be better, but for coding or tasks which require high accuracy especially if you use a quantized version these values yield much better results
spaceman_@reddit
Thanks for sharing!
Zeeplankton@reddit
Baseless conspiracy theory of the day: I think a lot of the times these companies just copy and paste docs to get the release out. There's rarely a clean answer for temp / top k etc.
Beginning-Window-115@reddit
I've never seen a model use top-k 64 and yet Gemma uses it for some reason
Substantial-Ebb-584@reddit
This. And Claude models alone as a judge is not a good test, since the results will be biased. Additionally claude will bias towards results that have answers similar with its programming.
Traditional-Gap-3313@reddit
If you're using Claude code just spin up multiple subjects, one per query and every one of them will have a clean context.
Jumbling in multiple evaluations introduces noise. If your rubric is correctly written, then not seeing other samples should be a benefit. If it needs to see all of them to judge, then you can't trust the grades and simply need to rank them.
Final-Frosting7742@reddit
I find a model that spends 75% of its tokens thinking unusable on local hardware. Especially for RAG tasks where there is already big contexts to process.
That's why i don't like the Qwen family. Their over-verbosity counterweights any benefit they seem to have in terms of reasoning and such.
You should add inverse-verbosity weights. Providing the right answer with less tokens = better quality.
Far-Low-4705@reddit
Try giving it a single tool.
It stops the overthinking completely
Silver_Raspberry_811@reddit (OP)
This is a valid design question. Right now the rubric weights correctness (25%), completeness (20%), clarity (20%), depth (20%), usefulness (15%). None of those explicitly reward brevity. Adding a token-efficiency metric — quality per token — is something I've been thinking about.
The counterargument: if a model produces a more thorough answer in more tokens, should it be penalized? For local deployment where inference cost matters, yes. For quality evaluation, maybe not. I might report both: raw score and score-per-1K-tokens.
StupidityCanFly@reddit
For RAG I just send enable_thinking false and it works like a charm.
H_DANILO@reddit
Thats interesting because for me Qwen 3.5 397B is one of the most token efficient open models.
I find that the lower the parameters are, the more it tends to be verbose.
stddealer@reddit
Gemma3 31B is the first model that can successfully solve some riddles containing red herrings I like to test models with. Qwen3.5 27B gets fixated on the irrelevant information and gives a wrong answer, while Gemma4 manages to ignore it.
Silver_Raspberry_811@reddit (OP)
Hey, thanks for stopping by. If interested in more, join us here: https://discord.gg/2V3dg7hc
Zeeplankton@reddit
I just really like how gemma formats replies / communicates. It's a bit too glazy but it's just nice to read in LM studio. and 26 a4 is so fast my m3 max at 60tok/s.
Silver_Raspberry_811@reddit (OP)
Hey, thanks for stopping by. If interested in more, join us here: https://discord.gg/2V3dg7hc
Potential-Leg-639@reddit
A test with this one as well would be really interesting https://huggingface.co/Jackrong/Qwopus3.5-27B-v3
Silver_Raspberry_811@reddit (OP)
Hey, thanks for stopping by. If interested in more, join us here: https://discord.gg/2V3dg7hc
SnooWoofers7340@reddit
Nice testing man, some valubale info here thks for sharing.
Silver_Raspberry_811@reddit (OP)
Hey, thanks for stopping by. If interested in more, join us here: https://discord.gg/2V3dg7hc
3dom@reddit
Thank you! Very interesting test. Could be great to add Qwen 35B though.
Silver_Raspberry_811@reddit (OP)
Hey, thanks for stopping by. If interested in more, join us here: https://discord.gg/2V3dg7hc
daviden1013@reddit
I'm interested to see how Qwen3.5 35B A3B does. It seems more meaningful to compare Gemma 31B vs. Qwen 27B (dense) and Gemma 26B A4B vs. Qwen 35B A3B (MoE).
Eyelbee@reddit
This is genuinely more useful than most benchmarks. How did you run them btw, was it f16 versions?
Silver_Raspberry_811@reddit (OP)
Ran through OpenRouter API, so whatever quant/config the provider uses. Not f16 — that's the limitation of API-based evals. The engine is open-source if you want to run locally with known quants: github.com/themultivac/multivac-evaluation
Middle_Bullfrog_6173@reddit
The results look like you need harder tasks or a stricter rubric to really tell the difference between these. Do you have subscores you can use to tell how the differences come about in practice? E.g. completeness vs correctness vs writing quality or whatever.
Silver_Raspberry_811@reddit (OP)
I you have better suggestions or architecture design let me know will try it out in upcoming face as resources increases and so does the headcount.
Silver_Raspberry_811@reddit (OP)
All models ran through OpenRouter API — I don't control quantization. That's a known limitation and it's documented in the model_metadata.json saved with each eval.
Silver_Raspberry_811@reddit (OP)
Yes — every judgment has five subscores: correctness, completeness, clarity, depth, usefulness. I can break those out per model. Quick version: Gemma 4 31B scored highest on clarity across the board. Qwen 3.5 27B scored highest on completeness but lowest on clarity. That tracks with the verbosity pattern.
Full per-question scores are on GitHub: github.com/themultivac/multivac-evaluation/tree/main/data/GEMMA4-H2H-20260404
Charming_Support726@reddit
Nice. My simple take is that both models are in the same ballpark. Qwen 3.5 has some advantage, but Gemma 4 is very good, especially in human communication - hard to measure with LLM-as-a-Judge.
It feels like Gemma is just lacking a bit of tuning
Silver_Raspberry_811@reddit (OP)
Good to see you again. Agreed — they're in the same ballpark. The Gemma 4 communication results are interesting because that's where subjective quality matters most and LLM-as-judge is weakest. Would love to hear if your experience running them locally matches these scores.
Look_0ver_There@reddit
Would've been nice to also include Qwen 3.5 35B-A3B, since that is the closest counterpart to Gemma 4 26B-A4B
I'm also a little confused on how a "win" is chosen.
Silver_Raspberry_811@reddit (OP)
You're all right — the MoE-to-MoE comparison (Gemma 4 26B-A4B vs Qwen 3.5 35B-A3B) is the more meaningful matchup. I'll queue that as the next H2H. Dense-vs-dense and MoE-vs-MoE makes more sense than mixing them.
If you want to suggest which questions or categories matter most for that comparison, the Discord has a #suggest-models channel: discord.gg/QvVTPCxH
Fun_Nebula_9682@reddit
solid setup, appreciate the transparency on methodology. one thing worth checking — in my experience claude as judge tends to favor longer, more structured responses. if one of the three consistently outputs more text that could inflate scores independent of actual quality. easy to check by plotting score vs response length across all 90 answers.
also the meta-alignment category feels like it'd be most susceptible to single-judge bias — claude will naturally prefer responses that match its own alignment style. running even one more judge (local llama 3 70b or qwen) and checking if rankings hold would make the results way more convincing imo
Silver_Raspberry_811@reddit (OP)
Both suggestions are spot-on. I do have the token counts per response — Qwen averaged 3-5x more tokens than the Gemma models. I haven't plotted score vs response length yet but that's exactly the right analysis. Will do that and post the correlation.
On adding a second judge: I ran judge stats across 150 prior frontier evals and Claude Opus 4.6 was the most reliable by a wide margin (99.9% parse, balanced scoring). Adding a local Llama 3 70B as a second judge is a good idea for cross-validation. The methodology discussion channel on the Discord has been good for working through these design questions: discord.gg/QvVTPCxH
dubesor86@reddit
The token verbosity/inefficiency is a real killer during local use.
WetSound@reddit
In my tests 31B can go way deeper and complex than the others, before totally loosing it.
ShelZuuz@reddit
Would be good if these results have t/s, because 8.82 on both 26B-A4B and 31B doesn't make them equivalent.