Personal Eval follow-up: Gemma4 26B MoE (Q8) vs Qwen3.5 27B Dense vs Gemma4 31B Dense Compared

Posted by Lowkey_LokiSN@reddit | LocalLLaMA | View on Reddit | 36 comments

This is a follow-up update to my previous post comparing Qwen 3.6 35B vs Gemma 4 26B.

I wanted to particularly follow-up with the following: 1. Gemma 4 26B could've suffered the quantization tax and perform drastically better with an 8-bit quant. So I wanted to put that to the test with UD's Q8_K_XL this time 2. A lot of people (including myself) were curious to see how the Qwen 3.5 27B dense would perform in these tests. 3. Speaking of dense models, I also wanted to include the Gemma 4 31B to see how it performs.

Sharing results consolidated with previous run for a complete comparison


1. Test Results

Metric Qwen3.6-35B Q4 Gemma4-26B Q4 Gemma4-26B Q8 Qwen3.5-27B Q4 Gemma4-31B Q4
Baseline failures 37 37 37 37 37
Tests fixed 32 (86.5%) 28 (75.7%) 17 (45.9%) 37 (100%) 37 (100%)
Regressions 0 8 0 0 0
Net score 32 20 17 37 37
Still failing (of 37) 5 9 20 0 0
Post-run total failures 5 17 20 0 0
Guardrail violations 0 0 0 0 0

2. Token Usage

Metric Qwen3.6 Q4 Gemma4 26B Q4 Gemma4 26B Q8 Qwen3.5-27B Q4 Gemma4 31B Q4
Input tokens 634,965 1,005,964 703,732 553,137 1,115,666
Output tokens 39,476 89,750 68,055 42,183 62,465
Grand total (I+O) 674,441 1,095,714 771,787 595,320 1,178,131
Cache read tokens 4,241,502 3,530,520 3,044,400 7,518,047 3,335,808
Output/Input ratio 1:16 1:11 1:10 1:13 1:17
Tokens per fix ~21K ~39K ~45K ~16K ~32K
Tokens per net score point ~21K ~55K ~45K ~16K ~32K

Qwen3.5-27B remains the most token-efficient (16K/fix). Gemma4 31B used 1.18M total tokens — the most of any model — largely due to 8 compaction events forcing full context re-prefills. Its cache read total (3.3M) is modest because compactions reset the cache each time.


3. Tool Calls

Tool Qwen3.6 Q4 Gemma4 26B Q4 Gemma4 26B Q8 Qwen3.5-27B Q4 Gemma4 31B Q4
read 46 39 25 91 (1 err) 37
bash 33 30 31 23 29
edit 14 13 12 (1 err) 31 21
grep 16 10 6 33 6
write 1 0 4 1 1
glob 1 1 3 1 2
todowrite 4 3 1 1 4
Total 115 96 82 181 100
Successful 115 (100%) 96 (100%) 81 (98.8%) 180 (99.4%) 100 (100%)
Failed 0 0 1 1 0
Derived Metric Qwen3.6 Q4 Gemma4 26B Q4 Gemma4 26B Q8 Qwen3.5-27B Q4 Gemma4 31B Q4
Unique files read 18 27 19 23 27
Unique files edited 7 13 9 9 12
Reads per unique file 2.6 1.4 1.3 4.0 1.4
Tool calls per minute 2.3 1.1 1.2 1.2 0.16
Edits per fix 0.44 0.46 0.65 0.84 0.57
Bash (pytest) runs 33 30 31 23 29

4. Timing & Efficiency

Metric Qwen3.6 Q4 Gemma4 26B Q4 Gemma4 26B Q8 Qwen3.5-27B Q4 Gemma4 31B Q4
Wall clock 2,950s (49m) 5,129s (85m) 4,142s (69m) 8,698s (145m) 37,748s (629m)
Total steps 120 104 88 186 109
Avg step duration 10.0s 21.7s 24.0s 15.9s 82.2s

Gemma4 31B's 82.2s average step duration is dominated by its 7.1 t/s decode speed and 8 compaction events. Each compaction required re-prefilling 60-80K tokens at 39 t/s, costing 25-35 minutes per event. Compaction re-prefills alone account for ~36% of total wall time.


6. Model & Server Configuration

Property Qwen3.6-35B Q4 Gemma4-26B Q4 Gemma4-26B Q8 Qwen3.5-27B Q4 Gemma4-31B Q4
Total parameters 35B 26B 26B 27B 31B
Active parameters 3B 4B 4B 27B 31B
Quantization Q4_K_XL Q4_K_XL Q8_K_XL Q4_K_XL Q4_K_XL
Context 100,000 100,000 100,000 100,000 100,000
temperature 0.6 1.0 1.0 0.6 1.0
top_p 0.95 0.95 0.95 0.95 0.95
top_k 20 64 64 20 64

Key Observations

I'd say Gemma4 31B is objectively the most capable in my tests but it's also the slowest of the bunch with my setup. Qwen 3.5 27B follows up with comparable performance at a lot more tolerable speeds. Qwen 3.6 35B remains the speed-to-performance champ and will remain being my daily driver for the same reason.