Personal Eval follow-up: Gemma4 26B MoE (Q8) vs Qwen3.5 27B Dense vs Gemma4 31B Dense Compared
Posted by Lowkey_LokiSN@reddit | LocalLLaMA | View on Reddit | 36 comments
This is a follow-up update to my previous post comparing Qwen 3.6 35B vs Gemma 4 26B.
I wanted to particularly follow-up with the following: 1. Gemma 4 26B could've suffered the quantization tax and perform drastically better with an 8-bit quant. So I wanted to put that to the test with UD's Q8_K_XL this time 2. A lot of people (including myself) were curious to see how the Qwen 3.5 27B dense would perform in these tests. 3. Speaking of dense models, I also wanted to include the Gemma 4 31B to see how it performs.
Sharing results consolidated with previous run for a complete comparison
1. Test Results
| Metric | Qwen3.6-35B Q4 | Gemma4-26B Q4 | Gemma4-26B Q8 | Qwen3.5-27B Q4 | Gemma4-31B Q4 |
|---|---|---|---|---|---|
| Baseline failures | 37 | 37 | 37 | 37 | 37 |
| Tests fixed | 32 (86.5%) | 28 (75.7%) | 17 (45.9%) | 37 (100%) | 37 (100%) |
| Regressions | 0 | 8 | 0 | 0 | 0 |
| Net score | 32 | 20 | 17 | 37 | 37 |
| Still failing (of 37) | 5 | 9 | 20 | 0 | 0 |
| Post-run total failures | 5 | 17 | 20 | 0 | 0 |
| Guardrail violations | 0 | 0 | 0 | 0 | 0 |
2. Token Usage
| Metric | Qwen3.6 Q4 | Gemma4 26B Q4 | Gemma4 26B Q8 | Qwen3.5-27B Q4 | Gemma4 31B Q4 |
|---|---|---|---|---|---|
| Input tokens | 634,965 | 1,005,964 | 703,732 | 553,137 | 1,115,666 |
| Output tokens | 39,476 | 89,750 | 68,055 | 42,183 | 62,465 |
| Grand total (I+O) | 674,441 | 1,095,714 | 771,787 | 595,320 | 1,178,131 |
| Cache read tokens | 4,241,502 | 3,530,520 | 3,044,400 | 7,518,047 | 3,335,808 |
| Output/Input ratio | 1:16 | 1:11 | 1:10 | 1:13 | 1:17 |
| Tokens per fix | ~21K | ~39K | ~45K | ~16K | ~32K |
| Tokens per net score point | ~21K | ~55K | ~45K | ~16K | ~32K |
Qwen3.5-27B remains the most token-efficient (16K/fix). Gemma4 31B used 1.18M total tokens — the most of any model — largely due to 8 compaction events forcing full context re-prefills. Its cache read total (3.3M) is modest because compactions reset the cache each time.
3. Tool Calls
| Tool | Qwen3.6 Q4 | Gemma4 26B Q4 | Gemma4 26B Q8 | Qwen3.5-27B Q4 | Gemma4 31B Q4 |
|---|---|---|---|---|---|
| read | 46 | 39 | 25 | 91 (1 err) | 37 |
| bash | 33 | 30 | 31 | 23 | 29 |
| edit | 14 | 13 | 12 (1 err) | 31 | 21 |
| grep | 16 | 10 | 6 | 33 | 6 |
| write | 1 | 0 | 4 | 1 | 1 |
| glob | 1 | 1 | 3 | 1 | 2 |
| todowrite | 4 | 3 | 1 | 1 | 4 |
| Total | 115 | 96 | 82 | 181 | 100 |
| Successful | 115 (100%) | 96 (100%) | 81 (98.8%) | 180 (99.4%) | 100 (100%) |
| Failed | 0 | 0 | 1 | 1 | 0 |
| Derived Metric | Qwen3.6 Q4 | Gemma4 26B Q4 | Gemma4 26B Q8 | Qwen3.5-27B Q4 | Gemma4 31B Q4 |
|---|---|---|---|---|---|
| Unique files read | 18 | 27 | 19 | 23 | 27 |
| Unique files edited | 7 | 13 | 9 | 9 | 12 |
| Reads per unique file | 2.6 | 1.4 | 1.3 | 4.0 | 1.4 |
| Tool calls per minute | 2.3 | 1.1 | 1.2 | 1.2 | 0.16 |
| Edits per fix | 0.44 | 0.46 | 0.65 | 0.84 | 0.57 |
| Bash (pytest) runs | 33 | 30 | 31 | 23 | 29 |
4. Timing & Efficiency
| Metric | Qwen3.6 Q4 | Gemma4 26B Q4 | Gemma4 26B Q8 | Qwen3.5-27B Q4 | Gemma4 31B Q4 |
|---|---|---|---|---|---|
| Wall clock | 2,950s (49m) | 5,129s (85m) | 4,142s (69m) | 8,698s (145m) | 37,748s (629m) |
| Total steps | 120 | 104 | 88 | 186 | 109 |
| Avg step duration | 10.0s | 21.7s | 24.0s | 15.9s | 82.2s |
Gemma4 31B's 82.2s average step duration is dominated by its 7.1 t/s decode speed and 8 compaction events. Each compaction required re-prefilling 60-80K tokens at 39 t/s, costing 25-35 minutes per event. Compaction re-prefills alone account for ~36% of total wall time.
6. Model & Server Configuration
| Property | Qwen3.6-35B Q4 | Gemma4-26B Q4 | Gemma4-26B Q8 | Qwen3.5-27B Q4 | Gemma4-31B Q4 |
|---|---|---|---|---|---|
| Total parameters | 35B | 26B | 26B | 27B | 31B |
| Active parameters | 3B | 4B | 4B | 27B | 31B |
| Quantization | Q4_K_XL | Q4_K_XL | Q8_K_XL | Q4_K_XL | Q4_K_XL |
| Context | 100,000 | 100,000 | 100,000 | 100,000 | 100,000 |
| temperature | 0.6 | 1.0 | 1.0 | 0.6 | 1.0 |
| top_p | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 |
| top_k | 20 | 64 | 64 | 20 | 64 |
Key Observations
- Gemma 4 26B's performance remains in the same ballpark even with Q8. It performed slightly worse than Q4 in this run but that variance is likely noise. I'll stick with my Q4_K_XL quant
- Both Qwen 3.5 27B and Gemma 4 31B aced the test. The dense models are in a different league from the MoE ones. (Especially the Gemma 31B)
- Gemma 4 31B is the most efficient when it comes to tool calling. It fixed all issues in 100 error-free tool calls
- Qwen 3.5 27B is the most token-efficient expending an average of 16 tokens per fix.
- Gemma 4 31B also exhibited extremely low inference speeds for some reason and ran for 10 hours and 29 minutes due to the abysmally slow speeds. DRAM also bloated upto 70GB even with -cram and -ctkcp flags. I'm not sure if this is expected.
I'd say Gemma4 31B is objectively the most capable in my tests but it's also the slowest of the bunch with my setup. Qwen 3.5 27B follows up with comparable performance at a lot more tolerable speeds. Qwen 3.6 35B remains the speed-to-performance champ and will remain being my daily driver for the same reason.
MaxEkb77@reddit
make new test on qwen3.6-27b :)
Lowkey_LokiSN@reddit (OP)
lol, 3.5 27B aced it already.
But yes, I'm considering broadening and hardening the eval set so it can withstand the test of time. Local models are too good for beginner stuff nowadays
sagiroth@reddit
Qwen 3.6 27B when?
DjsantiX@reddit
We are all waiting for 3.6 dense aahahha i think they are stalling because they will be very powerful models
irlylikeonionrings@reddit
What do you mean? Qwen 3.6 27b (dense) is already released
DjsantiX@reddit
1 hour after the comment ahah Damn, I was off by a few hours
Fresh-Resolution182@reddit
629 minutes for Gemma4 31B is the buried headline here. same score as Qwen3.5 27B at 145min but 4x wall time and DRAM bloating to 70GB. the dense models win on evals but that MI50 might be hitting a memory bandwidth cliff.
DependentBat5432@reddit
Why the dense models crushed the MoE ones so hard, is this a general pattern or specific to coding tasks?
TapAggressive9530@reddit
What tests are you running ? Please list here if you can
RMK137@reddit
Running Gemma 4 31B with LM Studio server blows up ram for me too especially when using it to explore a codebase with a coding agent. I am not sure what is going on, hope it gets fixed soon.
Sadman782@reddit
in llama.cpp you can fix it with --ctx-checkpoints 1
I don't know about LM studio, I don't use them as they dont give you maximum control even if they are using llama.cpp as a backend
Lowkey_LokiSN@reddit (OP)
Unfortunately this doesn't mitigate the DRAM issue for the model. I did run Gemma 4 31B with -ctxcp (short flag alternative for --ctx-checkpoints) and -cram flags
It works reliably for the 26B MoE but not the dense model
finevelyn@reddit
Each checkpoint with gemma 4 31B is more than 3GB so maybe you just didn't set -ctxcp to a low enough value. The flag does work with this model. The slow speed sounds like it didn't fit to VRAM completely, because otherwise it is only slightly slower than qwen 27B.
Sufficient-Ninja541@reddit
try --parallel 1
RMK137@reddit
Thanks I'll give it a shot with pure llama.cpp.
MerePotato@reddit
Q8 underperforming Q4 makes me highly suspicious that something was wrong with your setup here
Lowkey_LokiSN@reddit (OP)
I don’t think so. There’s always a level of variance with each run and these inconsistencies come off as noise to me. Unlike the Q4, the Q8 didn’t regress from its previous solutions with this run but scored a lower baseline.
A conventional bench would consolidate multiple such runs and provide an average but mine is just a personal test and a single run was a good-enough signal for me to catch quant-based differences which could be drastic.
Was this the best run with Q8? Probably not. But was this enough to gauge quantisation tax? IMO yes
Worried-Squirrel2023@reddit
the qwen 3.6 35B-A3B vs dense 31B comparison is the real signal here. if MoE 3B-active beats a dense 31B at coding tasks, the dense models below 70B are basically obsolete for local agents. throughput plus quality at the same vram budget.
ambient_temp_xeno@reddit
But it didn't beat it?
computehungry@reddit
it's because the comment you're referring to is from a bot.
Plabbi@reddit
Your comment is the real signal here
FusionCow@reddit
well sure, but then when the 27b dense comes out and beats the moe what will you say then
Divergence1900@reddit
yep exactly. the only fair way to compare the two would be training both the models on the same dataset and method
PromptInjection_@reddit
thank you. nice test.
Designer_Reaction551@reddit
The 26B Q8 scoring lower than Q4 with zero regressions is the interesting anomaly here. Usually quant fidelity improves stability, so either the Q4 run caught lucky passes that Q8 more conservatively rejects, or the MoE routing is calibration-sensitive and Q4 rounding is shifting expert selection in ways that happen to match your benchmark better. A per-test agreement matrix between Q4 and Q8 would tell you which. Also curious whether Q8 was run with identical sampling params - even a tiny top_p shift flips this kind of tight gap.
Snoo_27681@reddit
Thanks for doing and sharing all this testing. So it seems that Qwen3.5-27B is still slightly the best model here. Although much slower than 35B. And Gemma takes way longer? Do you measure tok/sec and ttft and everything when you do these tests?
Lowkey_LokiSN@reddit (OP)
Yes, Gemma 4 31B takes way longer for some reason. I do measure tok/sec data but decided not to include them as part of the post since they're relative to my setup and the total wall time taken to complete the runs make a lot more sense here to draw general comparisons
DanielusGamer26@reddit
But it can be useful for someone looking for similar setup! Pls include them if you can. Thanks for your amazing work!
RegularRecipe6175@reddit
Good stuff sir. Would be curious to see 8 bit Q3.6.
Reactor-Licker@reddit
What hardware are you using?
Lowkey_LokiSN@reddit (OP)
1 MI50 32GB Xeon 6148 128GB ECC DDR4 2666hz
mr_Owner@reddit
I would suggest to try the new iq4_nl_xl quant of unsloth as a 4bit quant test. I dont see the specific quantizer and quants used.
Alternative_Win_6154@reddit
Cloud you try the Q8 of the Qwen 3.6 35b? because the Q4 affects much more the moe models.
Sadman782@reddit
Can you try gemma 4 26B with topk 20, topk 64 doesn't make sense for coding even if google recommended it speiclaly for a quantized models, I find it does significantly better with topk 20.
onil_gova@reddit
Did you, by any chance, ensure that
preserve_thinkingwas on for the Qwen3.6 model?Lowkey_LokiSN@reddit (OP)
Oh, yes I did. Tests run with preserve_thinking turned on