Opus 4.5 claims 1st place on fresh SWE-bench-like problems in October [SWE-rebench]

Posted by Fabulous_Pollution10@reddit | LocalLLaMA | View on Reddit | 3 comments

Hey everyone,

We were excited about yesterday's release of Opus 4.5 and rushed to update the SWE-rebench leaderboard.

As generally expected, Opus 4.5 has claimed first place. Remarkably, it is much more cost-efficient than Opus 4, and only slightly more expensive per problem than Sonnet 4.5.

Check out the full leaderboard. Feel free to reach out if you'd like to see other models evaluated (Gemini 3 Pro is already on the way, of course).

[-]

LocalLLaMA-ModTeam@reddit

Rule 2 - Posts must be related to the topic of LLMs (preferably local).

Pristine-Woodpecker@reddit

I don't get this - that bench includes a bunch of local LLMs and compares them to the SOTA. It's extremely valuable.

voronaam@reddit

Looking at the dataset on Hugging Face:

SELECT count(*), patch like '%.py%' as is_python FROM _2025_10 group by is_python;

count_star()  is_python
109           true

100% of test problem are written in a single programming language. And in one with a fairly divergent syntax - the only one in top-5 languages not using C-style curly brackets, etc.

Something tells me this benchmark does not really matter... It covers a fairly small and isolated corner of Software Engineering.