Qwen 3.6 35B crushes Gemma 4 26B on my tests

Posted by Lowkey_LokiSN@reddit | LocalLLaMA | View on Reddit | 111 comments

I have a personal eval harness: A repo with around 30k lines of code that has 37 intentional issues for LLMs to debug and address through an agentic setup (I use OpenCode)

A subset of the harness also has the LLM extract key information from reasonably large PDFs (40-60 pages), summarize and evaluate its findings.

Long story short: The harness tests the following LLM attributes: - Agentic capabilities - Coding - Image-to-text synthesis - Instruction following - Reasoning

Both models at UD-Q4_K_XL for a fair baseline running optimal sampling params. Gemma 4's GGUF after google's latest chat-template fixes and -cram, -ctkcp flags to mitigate DRAM blowups

Here's how it went:

                        Qwen3.6             Gemma 4
                    ┌──────────────┐   ┌──────────────┐
  Tests Fixed       │   32 / 37    │   │   28 / 37    │
  Regressions       │      0       │   │      8       │
  Net Score         │     32       │   │     20       │
  Post-Run Failures │      5       │   │     17       │
  Duration          │    49 min    │   │    85 min    │
                    └──────────────┘   └──────────────┘
                       WINNER ✓

1. Test Results

Metric Qwen3.6-35B-A3B Gemma 4-26B-A4B
Baseline failures 37 37
Tests fixed 32 (86.5%) 28 (75.7%)
Regressions 0 8
Net score (fixed − regressed) 32 20
Still failing (of original 37) 5 9
Post-run total failures 5 17
Guardrail violations 0 0
Qwen actually identified the 5 leftover failures but decided they were out of scope and intentionally skipped them. Gemma just gave up with multiple retries.

2. Token Usage

Metric Qwen3.6 Gemma 4 Ratio
Input tokens 634,965 1,005,964 Gemma 1.6x more
Output tokens 39,476 89,750 Gemma 2.3x more
Grand total (I+O) 674,441 1,095,714 Gemma 1.6x more
Cache read tokens 4,241,502 3,530,520 Qwen 1.2x more
Output/Input ratio 1:16 1:11 Gemma more verbose
Tokens per fix ~21K ~39K Gemma 1.9x more expensive
Tokens per net score point ~21K ~55K Gemma 2.6x more expensive

3. Tool Calls

Tool Qwen3.6 Gemma 4
read 46 39
bash 33 30
edit 14 13
grep 16 10
todowrite 4 3
glob 1 1
write 1 0
Total 115 96
Successful 115 (100%) 96 (100%)
Failed 0 0
Derived Metric Qwen3.6 Gemma 4
Unique files read 18 27
Unique files edited 7 13
Reads per unique file 2.6 1.4
Tool calls per minute 2.3 1.1
Edits per fix 0.44 0.46
Bash (pytest) runs 33 30

4. Timing & Efficiency

Metric Qwen3.6 Gemma 4 Ratio
Wall clock 2,950s (49m) 5,129s (85m) Gemma 1.74x slower
Total steps 120 104
Avg step duration 10.0s 21.7s Gemma 2.2x slower/step

Key Observations:

Qwen 3.6 35B A3B dominates Gemma 4 26B for my use case and has become my new daily driver striking the best balance of speed and performance.

On the flipside, here are a few pointers in Gemma's favour: - The Qwen 3.5/3.6 series of models have been incredibly resilient to quantization but I'm not sure if Gemma is. A full-weight comparison could be drastically different - Gemma's support is way less mature compared to Qwen's - Single-run variance could have impacted Gemma negatively. However, I believe the evaluation criteria across diverse categories of my harness does a decent job mitigating it. At the end of the day, this is just my personal test verdict.