Kimi-K2 0905, DeepSeek V3.1, Qwen3-Next-80B-A3B, Grok 4, and others on fresh SWE-bench–style tasks collected in August 2025

Posted by CuriousPlatypus1881@reddit | LocalLLaMA | View on Reddit | 32 comments

Hi all, I'm Anton from Nebius.

We’ve updated the SWE-rebench leaderboard with model evaluations of Grok 4, Kimi K2 Instruct 0905, DeepSeek-V3.1, and Qwen3-Next-80B-A3B-Instruct on 52 fresh tasks.

Key takeaways from this update:

All 52 new tasks collected in August are available on the site — you can explore every problem in detail.