SWE-rebench Leaderboard (March, April and May 2026): GPT-5.5, Opus 4.7, Cursor (Composer 2.5), Kimi K2.6 and More
Posted by CuriousPlatypus1881@reddit | LocalLLaMA | View on Reddit | 13 comments
Hi all,
Sorry for going missing — we’ve been collecting a larger, higher-quality set of more complex tasks. We’re excited to share a major leaderboard update covering the past three months.
We’ve updated the SWE-rebench leaderboard with 110 fresh Python tasks from GitHub PRs created in March, April, and part of May.
The setup follows the standard SWE-bench format: models read real PR issues, edit code, run tests, and must make the full test suite pass.
This time, instead of our usual monthly updates with a smaller number of tasks, we collected a larger batch so we could evaluate models on a broader task set. You can still select narrower task windows on the leaderboard if you want a more focused view.
We’ll add more models over the next week, including Gemini Flash 3.5, DeepSeek v4 Pro, Qwen3.5-397B-A17B, along with smaller models for local development. Going forward, we’ll continue updating models frequently, but over relatively larger task batches. We’re also working on adding multilingual tasks to the leaderboard, plus a few more things we’ll share soon. Please send requests for models you want us to run!
Looking forward to your thoughts and feedback.
Join the leaderboard channel in our Discord to discuss models, share ideas, ask questions, or report issues:
https://discord.gg/V8FqXQ4CgU
Eyelbee@reddit
Why is codex and claude code a separate entry?
Dany0@reddit
Listen, first of all, thank you
Second of all, it's upsetting how much we lean on Python. We are inadvertently steering the LLMs towards a certain "local optimum" where Python shines but some truly important tasks get neglected
That said, I don't have much else to add right now
soyalemujica@reddit
Happy to see 27B being just 5%\~ below Claude
jake_that_dude@reddit
best addition would be a fixed `tool_call_budget` / wall-clock column. for local models, pass rate without cost-to-fix is kinda incomplete because a 14B model that needs 4 retries is a totally different workflow than a 70B that lands `pass@1`.
__Maximum__@reddit
Interesting, but not here
randombsname1@reddit
Disagree. Even more interesting here as this is probably one of the better benchmarks instead of the usual benchmaxxed b.s. that is on here.
If we took benchmarks the Chinese models put out at face value then you would think they had like 3 or 4 Opus 4.7/GPT 5.5 - level models now.
nomorebuttsplz@reddit
my 512 gb m3u says otherwise
seamonn@reddit
Wrong Subreddit - We only care about Gemma 4:31b, Qwen 3.6:27b or Qwen 3.5:122b.
Either test local models or GTFO /r/LocalLLaMA
__JockY__@reddit
I only care about 3 models, therefore everybody else should feel the same.
kzoltan@reddit
Or any other open weight model?
Having the closed models on the (top of the) list just helps return to reality. So it is just fine like that.
DeProgrammer99@reddit
They tested Kimi K2.6 and GLM 4.7, so no foul!
Altruistic_Heat_9531@reddit
thanks
doesnt_matter_9128@reddit
Yooo was waiting for this