Gemma 4 31B > Kimi K2.5 > Grok 4.20 on DuelLab's highest reasoning leaderboard

Posted by Goa_@reddit | LocalLLaMA | View on Reddit | 6 comments

Gemma 4 31B: 53.9 score

Kimi K2.5: 50.5 score

Grok 4.20: 46.8 score

Funny to see the open Gemma 4 31B ahead of both.
Note, these scores are about writing competitive code...

[-]

Awkward-Boat1922@reddit

Anyone used this locally yet?

I've been putting off the download but it is surely not better than 122b?

[-]

gamblingapocalypse@reddit

I’m using Gemma 4 31B right now, and for my use cases I actually prefer it over Qwen 3.5 122B.

What stands out to me is that Gemma 4 31B tends to get what I’m asking for right on the first try, whereas Qwen will often get there eventually but usually needs more steering.

As for prompt processing, I haven’t measured it yet, but they seem fairly similar to me, maybe with a slight edge to Qwen. In the decoding phase, Qwen also feels faster. But honestly, Gemma’s outputs have been good enough that I still prefer it overall.

Gemma can be pretty aggressive with RAM usage. It may start off lighter, but once you actually work with it for a while, you can watch the RAM disappear. My understanding is that this is tied to how Gemma handles KV cache. I’ve heard turbo quant is strongly recommended to help reduce KV-cache memory usage, but I haven’t implemented it yet. Even so, I’m convinced enough by Gemma 4 that it’s an optimization path I’m willing to pursue.

[-]

Qwoctopussy@reddit

122b has only 10b active, which degrades its ability to understand interactions between distant tokens — this is what i’ve gathered anyway — so although it has more knowledge, it has less intelligence

[-]

Randomdotmath@reddit

i don't trust duellab's compares since they put GPT-5.4 Nano > GPT-5.4 Mini > Claude Sonnet 4.6, this has no sense

[-]

Goa_@reddit (OP)

It's a small sample and what you see in the Overall leaderboard has it's nuancies. But, 5.4 Mini and Nano are way more capable than any one yet realized...

[-]

Robert__Sinclair@reddit

I tested gemma-4 26B and gemma-4 31B the first one can't solve a simple logic problem which is not in any dataset.
the second one solved it.

As a comparison: gemini 1.5 flash couldn't solve it. Gemini 2.5 flash solved it randomly one time over 4...
Gemini 3 (and 3.1) solve it too.

Too bad that the 31B is quite big.

Can't wait to see a 14B or even an 8B model solve it.