SWE-rebench Leaderboard (March, April and May 2026): GPT-5.5, Opus 4.7, Cursor (Composer 2.5), Kimi K2.6 and More

Posted by CuriousPlatypus1881@reddit | LocalLLaMA | View on Reddit | 13 comments

Hi all,

Sorry for going missing — we’ve been collecting a larger, higher-quality set of more complex tasks. We’re excited to share a major leaderboard update covering the past three months.

We’ve updated the SWE-rebench leaderboard with 110 fresh Python tasks from GitHub PRs created in March, April, and part of May.

The setup follows the standard SWE-bench format: models read real PR issues, edit code, run tests, and must make the full test suite pass.

This time, instead of our usual monthly updates with a smaller number of tasks, we collected a larger batch so we could evaluate models on a broader task set. You can still select narrower task windows on the leaderboard if you want a more focused view.

We’ll add more models over the next week, including Gemini Flash 3.5, DeepSeek v4 Pro, Qwen3.5-397B-A17B, along with smaller models for local development. Going forward, we’ll continue updating models frequently, but over relatively larger task batches. We’re also working on adding multilingual tasks to the leaderboard, plus a few more things we’ll share soon. Please send requests for models you want us to run!

Looking forward to your thoughts and feedback.

Join the leaderboard channel in our Discord to discuss models, share ideas, ask questions, or report issues:
https://discord.gg/V8FqXQ4CgU

[-]

Eyelbee@reddit

Why is codex and claude code a separate entry?

Dany0@reddit

Listen, first of all, thank you

Second of all, it's upsetting how much we lean on Python. We are inadvertently steering the LLMs towards a certain "local optimum" where Python shines but some truly important tasks get neglected

That said, I don't have much else to add right now