Kimi-K2 0905, DeepSeek V3.1, Qwen3-Next-80B-A3B, Grok 4, and others on fresh SWE-bench–style tasks collected in August 2025
Posted by CuriousPlatypus1881@reddit | LocalLLaMA | View on Reddit | 32 comments
Hi all, I'm Anton from Nebius.
We’ve updated the SWE-rebench leaderboard with model evaluations of Grok 4, Kimi K2 Instruct 0905, DeepSeek-V3.1, and Qwen3-Next-80B-A3B-Instruct on 52 fresh tasks.
Key takeaways from this update:
- Kimi-K2 0915 has grown significantly (34.6% -> 42.3% increase in resolved rate) and is now in the top 3 open-source models.
- DeepSeek V3.1 also improved, though less dramatically. What’s interesting is how many more tokens it now produces.
- Qwen3-Next-80B-A3B-Instruct, despite not being trained directly for coding, performs on par with the 30B-Coder. To reflect models speed, we’re also thinking about how best to report efficiency metrics such as tokens/sec on the leaderboard.
- Finally, Grok 4: the frontier model from xAI has now entered the leaderboard and is among the top performers. It’ll be fascinating to watch how it develops.
All 52 new tasks collected in August are available on the site — you can explore every problem in detail.
z_3454_pfk@reddit
glm 4.5 is packing way above its weight
wolttam@reddit
I use it exclusively for coding, very cost effective
paryska99@reddit
Especially with their coding subscription API access, the website still has some things missing to it/need fixing, but they are probably working on it.
dwiedenau2@reddit
Gemini 2.5 Pro below Qwen Coder 30B does not sense any sense. Can you explain why 2.5 Pro was so bad in your benchmark?
z_3454_pfk@reddit
2.5 Pro has been nerfed for ages, just check openrouter or even the gemini dev forums
dwiedenau2@reddit
Yes of course, it is much worse than earlier, but not worse than qwen 30b lmao
lumos675@reddit
I am using qwen coder 30b almost everyday and i can tell you it solves 70 to 80 percents my coding needs. It's realy not that weak model. Did you even try it?
dwiedenau2@reddit
Yes, it was the first coding model that i was able to run locally, that was actually usable, its a great model. But not even CLOSE to 2.5 pro lol
Amgadoz@reddit
qwen3 coder at bf16 is better than 2.5 pro at q2 probably
balianone@reddit
quantized https://www.reddit.com/r/Bard/comments/1mwd67o/google_has_possibly_admitted_to_quantizing_gemini/
dwiedenau2@reddit
It is not worse than qwen 30b lmao, even after all the quantizing and cost reductions they have done hahah
CuriousPlatypus1881@reddit (OP)
Good question — and you’re right, at first glance it might look surprising. One possible explanation is that Gemini 2.5 Pro uses hidden reasoning traces. In our setup, models that don’t expose intermediate reasoning tend to generate fewer explicit thoughts in their trajectories, which makes them less effective at solving problems in this benchmark. That could explain why it scores below Qwen3-30B here, even though it’s a very strong model overall.
We’re also starting to explore new approaches — for example, some providers now offer APIs (like Responses API) that let you reference previous responses by ID, so the provider can use the hidden reasoning trace on their side. But this is still early research in our setup.
jonydevidson@reddit
Real winner here seems to be GPT-5 Mini.
nuclearbananana@reddit
Grok code fast too, it's crazy cheap
jonydevidson@reddit
Don't feel like funding Nazis, thank you.
russianguy@reddit
What about devstral?
CaptBrick@reddit
Came here to ask this too
FullOf_Bad_Ideas@reddit
Thanks and I hope you'll be posting this regularly until it's all saturated.
It's interesting how GPT 5 High uses less tokens per task than Claude Sonnet 4.
itsmeknt@reddit
What is the reasoning effort for GPT OSS 120b?
And can you add GPT OSS 20B (high reasoning) as well? It did really well in the aider leaderboard for a 20b model once the prompt template was fixed.
Mochila-Mochila@reddit
Nice, thanks for this update 👍
Great to see open source being competitive against top closed source models.
PsecretPseudonym@reddit
I’d love to see Opus 4.1 and gpt-5-codex on this.
abskvrm@reddit
GPT OSS 120B
jonas-reddit@reddit
It’s there. Look at the right side.
Farther_father@reddit
What the hell is going on with Gemini 2.5 Pro below Qwen-Coder30B3A?
CuriousPlatypus1881@reddit (OP)
Really appreciate the support! Great point on confidence intervals — we already show the Standard Error of the Mean (SEM) on the leaderboard, and since the sample size is just the number of problems in the time window, you can compute CIs directly from that. Regarding Gemini 2.5 Pro vs Qwen3-Coder-30B-A3B-Instruct, their scores are so close that the confidence intervals overlap, meaning the small ranking difference is likely just statistical noise.
Farther_father@reddit
Thanks for the reply! I was too lazy to bring out the ol’ calculator, but you’re right it can of course be calculated from the number of items and the proportion of correct responses).
Mkengine@reddit
Could you explain what the CI and error bars respectively tell me? I don't understand it.
Farther_father@reddit
The author/OP can probably better answer this, but as I understand it:
AK_3D@reddit
Do you have Opus in the benchmark lists? I don't see it (and I know several people use it for coding).
j_osb@reddit
Very, very impressed by Kimi K2!
Simple_Split5074@reddit
I'd love to see a reasoning version, that ought to be spectacular
Only_Situation_4713@reddit
We need a code version of Qwen next. It yearn for the codebase.