Confirmed: SWE Bench is now a benchmaxxed benchmark
Posted by rm-rf-rm@reddit | LocalLLaMA | View on Reddit | 99 comments
Posted by rm-rf-rm@reddit | LocalLLaMA | View on Reddit | 99 comments
Mashic@reddit
Goodhart's law: “When a measure becomes a target, it ceases to be a good measure.”
ScoreUnique@reddit
Why does it sound like I live in it, ah capitalism.
ScoreUnique@reddit
Let me explain: When GDP started determining the country's situation, we started racing for better GDP. Countries trying to benchmax GDP lmao
ThisWillPass@reddit
Thats why we could be heading to become all paperclips.
soshulmedia@reddit
It is the meta-pattern of our time. Widespread false beliefs that the map is the territory in pretty much all fields, institutions, and so forth.
Personally, I am wondering whether "not to eat the fruit of the tree of wisdom" is EXACTLY a warning against this.
ThisWillPass@reddit
Lol, never thought of it like that, interesting.
TheRealMasonMac@reddit
I assume you're referring to how it was originally in reference to UK monetary policy?
Borkato@reddit
I’m so stupid, I’ve never understood this. I thought they were the point. We WANT things that can score well on it. But I guess it’s saying it stops being a good measure because people cut corners to achieve it? I don’t know
Murgatroyd314@reddit
There are things that actually matter, and there are things that are easy to measure. They are rarely the same. Often, something that actually matters is connected to something that is easy to measure. But when people start trying to maximize the thing that is easy to measure, they break the connection, so their score improves, but the thing that matters doesn't.
Mashic@reddit
Let's say you want to teach a foreigh language like German, and you use the same exam for everyone each time to measure mastery. Someone can go and read the exam questions, memorize the answers, and excel at it without actually learning German.
When the person's target became to excel at the exam instead of learning German, the test stopped being good at measuring its original intent, German language mastery.
Hans-Wermhatt@reddit
I think we should go too overboard with this. Reddit communities go overboard with this concept of bench-maxxing. It's like if your test in German was 128 long answer or essay type questions and you wanted to memorize answers. And there were also a lot of other exams you had to take in German.
Memorizing is not an effective strategy. What is, is to master German and then use the test answers as bonus study material to boost your scores. But you absolutely could not ace the tests without a deep knowledge in German.
Mashic@reddit
I think you're understanding the case in zero or everything dichotomy, which is not the real issue we're discussing. The issue is that the LLM makers may overtrain their models on the tests in these benchmarks, so they are in their knowledge database, and then they score higher when on these tests. They might ignore or not training it on other types of issues, but the case to be made here, is that the benchmark stops being a good metric to evaluate how good the model is. The model itself might be good or not.
Borkato@reddit
I see, so it’s really not as profound as it sounds! I kept trying to read more into it, but it’s literally just “if you’re willing to do it by any means necessary, the results are no longer meaningful”
KaMaFour@reddit
I don't think that's all there is to it
See one of the examples from wikipedia:
Borkato@reddit
Isn’t that more of a backfiring effect? Like the snake catcher story or whatever where they end up breeding them to make money
iamapizza@reddit
I often remind people at work about this, right as they're looking to implement some new braindead metric to help justify their existence. Of course they do it anyway.
rpkarma@reddit
Companies force us to though :(
philmarcracken@reddit
im always fighting the upper floors on that. The other one is saying 'theres nothing more permanent than a temporary measure'
Cute_Obligation2944@reddit
Accountability sink. If it's policy on paper in black and white, no individual can be held responsible for the consequences.
wektor420@reddit
And while he was referring to human psychology ot works for models too, it is important to have seperate train, verification, and test sets
iamapizza@reddit
Goodmaxxing
alphatrad@reddit
The whole problem with all of these, and even SWE Pro doesn't solve it is this "demonstrate a fail-to-pass transition for new tests"
You want to know why AI SLOP exists, because all these test for was, did it make the test go from red to green.
They don't care if it took 30 tries, if it refactored all the code, didn't follow the scope of the project, wrote other shit it didn't need to or created additional bugs.
Just, did the test pass without a fail.
This is why we have these models scoring so high and then us devs use them in the real world and get mad at them. They write fucking slop.
Hot_Turnip_3309@reddit
Confirmed, my dad works at nintendo
Velocita84@reddit
The final destination for any public benchmark, unfortunately
Deep90@reddit
Benchmarks need to be seeded.
Have a public seed so that people can independently verify that a certain LLM can score what it says it does.
Then have a private seed (or even a few private seeds) that only the benchmarking website knows. If results drop, then you know a model was overfit to score well. Multiple private seeds help you know it isn't a fluke.
IrisColt@reddit
I tweak the parameters to adjust the math difficulty in my quick-and-dirty personal benchmarks. But sometimes, the LLMs refuse to even start a task if they realize it's going to require a ton of computation, heh... By he way, just changing the random seed doesn't really help, because some models generalize perfectly to whatever values you throw at them.
pm_me_github_repos@reddit
Funny enough, posttraining can be applied on any signal, including private scoring. So one could still hill climb on performance on a private dataset as long as you can a score.
Calm_Bit_throwaway@reddit
Is this really going to solve the issue at all? My impression is the problem stems from overtraining on a particular task. The labs are running the benchmarks and producing results that satisfy the numbers they say and not just generally optimizing for seed. However, their training process optimizes the score in a way that is non generalizable.
ShengrenR@reddit
Challenging to have something "seed" based when a lot of these things are basically just q/a exams - for something like the food truck you get all sorts of random generations at the decisions so it makes sense, but a lot of the swe sorts are just "solve this single problem" a bunch without a lot of variation - and for some of those, introducing sufficient variation makes the "answer key" pretty hard to write without making it easy to benchmax against. The public/private does a lot to help, but then you have to have the private group running the things all the time.
iperson4213@reddit
Private bench isn’t sufficient, the data needs to be sourced privately as well.
For example, SWEBenchPro tasks are to implement things in code bases, but it was partially sourced from open source code bases which have the implementations, so it can be trained on even if the questions are private.
ThirdWaveCat@reddit
The CodeClash benchmark is a neat idea which shows how much better humans currently are at the final product and the journey. There are many turn-based games, a few existing companies in the space, and academic competitions.
Carefully scrutinized benchmarks are the end-game for most academic contests like databases and protein structure prediction.
lendo93@reddit
Keeping details private helps, but it's not essential. You just need a smarter benchmark design that doesn't have knowable correct answers, and yet is objectively scored. Easier said than done, but we created complex multiplayer environments for exactly this scenario, and the benchmark results have really come out nicely as we've scaled up: https://gertlabs.com
Shingikai@reddit
swe-rebench.com solves contamination. It doesn't solve the more fundamental problem, which is that even a perfectly secure, constantly refreshed SWE-bench tells you how a model performs on curated, self-contained coding tasks from public repos, not on your codebase, with your conventions, your tech debt, your ambiguous requirements, and your context spread across three Jira tickets and a Slack thread from six months ago.
The benchmaxxing is a symptom. The actual gaps between test scores and real performance go deeper: task selection bias (problems that can be cleanly specified and verified), scaffolding effects (agent harnesses optimized for the benchmark format), and domain mismatch (open source public code vs. whatever you're actually building).
SWE-bench Pro probably buys another cycle before the same thing happens. The harder fix is accepting that no public benchmark survives competitive evaluation pressure for long, and building your own internal evals for your actual use case is the only thing that tells you which model is actually useful for your problem.
RoadFew6394@reddit
is there even a reliable benchmark that is left anymore to measure the intelligence of these LLMs?
BriefImplement9843@reddit
simplebench and lmarena
zball_@reddit
MRCR v2 still very reliable.
Express_Quail_1493@reddit
I just built my own private benchmark and I advise everyone to do their own also. It wont work if its sitting on a public gitrepo or shared on reddit. But i would like us all to come together build our benchmark based on what we use the models for and share the model performances. Im suspicious some people in these benchmarking teams are gettin paid to lie too. LMAO the Ai race is BRUTAL. But right now my private bench is my source of truth avoids me from getting hijacked by all the flashy titles and news headlines
suicidaleggroll@reddit
While I'm all for open source, benchmarks really need to be closed in order to remain effective. As soon as a benchmark is made public, it gets trained on, and ceases to be useful.
MrMisterShin@reddit
The problem is these AI companies capture the user prompts. If they run any benchmark, they will know all the questions their model was given in the session. They can then work on benchmaxing from there.
Obviously this mostly applies to API or chat interface rather than completely local or offline.
Lechowski@reddit
That's not how private benchmarks works and it is a trivial problem to solve.
The ARC-AGI test for example does not run on the servers of the testee, but on those of the tester. The testee (i.e: OpenAI) gives an instance running the model in a 3rd party (i.e: Azure Machine Learning) that is isolated from the rest of the world and without internet access. The prompt is sent for inference to that sandbox, the answer is kept in that sandbox too. The sandbox is destroyed after the results are calculated.
This is standard practice for tenant isolated environments, for example for gov clouds.
keepthepace@reddit
Good luck getting new research benchmarks from getting that treatment
Good luck spotting someone cheating on that.
Lechowski@reddit
How could they cheat? The sandbox is owned and controller by the tester. They have full control over all the network. Just deny all network except the incoming prompt.
keepthepace@reddit
You think OpenAI or Anthropic will let any lab control a machine with their weights on board?
Lechowski@reddit
Any lab no. Azure Machine Learning. They literally have to ask part of their agreement with Microsoft and you can do it yourself. You can rent your own a100 cluster and run the weights yourself. You can cut off all networking from that cluster too.
They don't have to share the weights to anyone else other than an already approved 3rd party like Microsoft.
BihariBabua@reddit
So you're saying Anthropic would trust Microsoft with their Claude weights?
poginmydog@reddit
Bruh they’re literally investors of Anthropic. Yes they may not trust them fully but NDA and lawsuits should keep everyone in check. And Microsoft can literally afford to set up a physical clean room where execs from both sides come in the inspect the setup. Microsoft setup the clean room, Anthropic comes in with an SSD and plugs it in. All kept within a few container racks that can be physically inspected.
BihariBabua@reddit
Does it also test deepseek and the likes?
poginmydog@reddit
Maybe yea. These tests can literally affect stock prices so a pure testing clean room is definitely worth the investment. Probably won’t even take more than a full server rack if the model is less than 1T parameters. Takes an engineer half a day’s work to build up a system. For transparency, they could even ask Anthropic to bring in their own hardware and have a third party auditor to view the results.
Fun fact, China inspected the Windows XP source code in a clean room as well.
Lechowski@reddit
No. I talked about OpenAI and Microsoft, because Microsoft bought 49% of OpenAI in an agreement that included commercial use of their models until 2027. Therefore, OpenAI is obliged to give the weights of their models to Microsoft which is why you can host them on AML. ARC-AGI testers can use this to do isolated testing on OpenAI models.
Similar agreements can be made for other providers. Anthropic definitely will give an isolated environment to ARC-AGI testers because scoring high there is amazing marketing.
My point is that this is a trivial problem with a trivial solution. Have self destroyed sandboxes environments to do confidential compute tasks is a thing as old as computers themselves.
Dos-Commas@reddit
You are not suppose to, at least with the corporate accounts because otherwise no business would use a service that logs all of their corporate data.
suicidaleggroll@reddit
lol
Corporations use centralized SaaS systems that harvest all of their data constantly. Microsoft being a prime example.
MrMisterShin@reddit
They weren’t supposed to use copyright data or illegally web scrape against ToS etc. But some did and some paid fines for this. Some even used torrented data / pirated books.
Scared_Bedroom_8367@reddit
How would they know? There would be billions of prompts
MrMisterShin@reddit
They can log into their own GPT or Claude account and run the benchmarks. They will have all the questions, not necessarily all the answers tho.
henfiber@reddit
They cannot do that for the private part, though. They can only try to search in logs from accounts they suspect may be running the private benchmarks.
MrMisterShin@reddit
No, I mean companies themselves (E.g. OpenAI), not you or I. OpenAI run their WIP or new models on many benchmarks before they release them. They run them against Public or Private benchmarks it doesn’t matter, it’s data that can and will be captured.
Perfect example is OpenAI recently released GPT-5.5 , they listed a private benchmark on the model card… Expert-SWE (Internal) and Investment Banking Modelling Tasks (Internal)… You won’t find a score for Claude or Gemini against those benchmarks, they don’t want that data leaking out to competitors and it would be embarrassing if they scored higher.
On the other hand, it also makes it impossible to validate these new OpenAI benchmarks, outside of their ecosystem. - OpenAI basically said “Trust me bro!” With those benchmarks.
henfiber@reddit
But OP referred to closed benchmarks as not available to AI companies to avoid benchmaxxing. AI companies may have access to (or even create their own) private benchmarks, but that's not what we're discussing here.
If we were speaking about benchmarks they had already access to, then your comment regarding logging in with their own GPT or Claude account does not make sense either. In order to run the benchmarks as you mentioned, they would already have access to the questions.
Reddit_User_Original@reddit
How about open source, but updated every quarter with different challenges
Mashic@reddit
Or have an agency with its closed sourced benchmarks, and every time a new models pops, they benchmark it on their own and only publish the results.
Former-Ad-5757@reddit
It is a billion dollar market, paying a million for a preview test on a certain “special” api endpoint is chump change,and then the final model gets a 100%
Thick-Protection-458@reddit
So it means we can't compare experiments from different times. Great.
akavel@reddit
Well, that's exactly what SWE-rebench is doing! And monthly rather than quarterly.
FoxiPanda@reddit
I've pondered if we could go one step further and take a play out of the gaming industry's playbook: procedural generation of benchmark prompts. I haven't thought through all the ramifications yet, but it doesn't seem entirely impossible to do.
LegitimateCopy7@reddit
but if closed then it'll become a "trust me bro" benchmark. this is the reason why I keep saying that benchmarks have no meanings when it comes to LLM.
just organize your own set of tests for the specific tasks you require. if the model can perform the task with acceptable performance and cost, use it. stop wasting time min-maxing a indeterministic tool.
Foreign_Risk_2031@reddit
Just make it academic again
noctrex@reddit
That's why https://swe-rebench.com exists. It constantly refreshes the problems every test
Former-Ad-5757@reddit
Refreshed problems means it is more difficult to compare, because the q’s were different.
jubilantcoffin@reddit
They are not, the models get the same set.
BlipOnNobodysRadar@reddit
not if they're all re-benched on the same questions and the score is compared only on that set for each set
keepthepace@reddit
Beats the "vibe test", which is the only good alternative.
Technical-Earth-3254@reddit
When llms start getting over 60% in benchmarks, they need to get updated.
Wonderful_Second5322@reddit
Lmao
Downtown-Art2865@reddit
typical benchmark lifecycle: gets popular → labs train on it → benchmark dies → new benchmark → repeat
we're basically running natural selection on benchmark resistance at this point
Independent-Date393@reddit
just organize your own evals for the tasks you actually care about is always where this ends up. every public leaderboard eventually becomes a race to train on its vibes
Tagedieb@reddit
Easier said than done, especially if the tasks you cared about yesterday are not the tasks you care about tomorrow.
Independent-Date393@reddit
goodhart's law eating another one. MMLU is next in line.
rm-rf-rm@reddit (OP)
I thought MMLU has been done for a while now
Independent-Date393@reddit
OpenAI retiring a benchmark they were ranked #1 on and citing contamination concerns is going to be one of the more self-aware moves they've made. the timing — right as everyone else caught up — is noted.
hsoj95@reddit
It seems like there are two options for helping stop this from happening. Firstly, benchmarks probably need to be more... Abstract? Aka, have the core idea of what's being tested be abstract and then test it on those ideas with different (and unique) prompts and data that fall within that abstract idea. Make it so that you can't just train on the specifics of the benchmark as a target, you have to account for shifting data and prompts that fall within that abstract idea. Yes, it means it's not a hard coded benchmark to test on, and a few runs of it could fail horribly, but given enough testing on it, a pattern should emerge that shows what the performance is actually like. (Note: I'm hardly an expert in this, and could very well be in over my head in making this suggestion. Feel free to roast me if so... x3)
Secondly, I think the best indicators of benchmarking should actually be against other models. I'm quite found of Arena-style benchmarks, as it seems to be a more organic way of judging a models true performance. Honestly, if a way to mass run models against each other with an automated check to see which did better (avoids potential human bias in the results), you could get some really good data from that across different testing categories. Combines it with the first option I described above and you'd have the potential for a great testing pattern. (Ironically, this is basically going back to a GAN-style way of testing... A GAN of LLM's. There should probably be an axiom named for this phenomena x3)
Like I said, I may be in way over my head with these suggestions, but it's just two that came to mind for me regarding ways to combat training models to benchmax scores.
Practical_Low29@reddit
The Scale Labs leaderboard comparison is actually pretty telling. When you look at the delta between public and private scores on swe-bench-pro, some models drop 15+ points. That gap alone tells you more about benchmark gaming than any official statement does.
Pleasant-Shallot-707@reddit
What was confirm was swe bench verified is benchmaxxed. They’re recommending swe bench pro
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Tight-Requirement-15@reddit
Hasn't this been clear for months?
rm-rf-rm@reddit (OP)
Yes, to you and I. But the fact that benchmark screenshots still get posted and upvoted like crazy for every new model release on this sub shows that masses still dont understand
Western_Objective209@reddit
Lol like most benchmarks, they haven't even taken the time to read their own questions until now. absolute joke
AvidCyclist250@reddit
so we need an independent testing body. preferably with some kind of overview
quarkral@reddit
After recommending everyone to use SWE-bench Pro, OpenAI's actual GPT 5.5 announcement uses Expert-SWE (Internal)
rm-rf-rm@reddit (OP)
they do give the SWE Bench pro score in the article, but yes its not included in the table - good catch.
spawncampinitiated@reddit
Pretends to be shocked
Sagyam@reddit
These benchmarks needs to be done inside an air-gaped virtual machines run by a trusted vendor like AWS, Azure etc.
Benchmark creator should be responsible to setting up all the necessary tooling to evaluate model performance inside the machine.
The actual questions should always remain a secret. Once the benchmark is done only the file containing results should leave the machine.
Everything else like model weights, questions, evaluation rubric, model response etc should be wiped before the air gap is released. Neither benchmark creator nor model creator should be allowed to see anything other than the final score.
Thomas-Lore@reddit
The private ones are.
_BreakingGood_@reddit
OpenAI started saying this as soon as they stopped being capable of beating Opus, it was pretty comical timing
Pyros-SD-Models@reddit
If in a decontaminated benchmark like SWE-ReBench my 6-month-old medium model is on par with Opus 4.6, but in SWE-Bench the same Opus leads by 15%, then yes, that looks like pretty comical benchmaxxing by Anthropic. And a good opportunity to say something imho
https://swe-rebench.com/
_BreakingGood_@reddit
My point isn't to say that it's not benchmaxxed.
My point is that OpenAI only has an issue with this one benchmaxxed benchmark that they can't win on. They've got no problems with any of the others.
FuckSides@reddit
An important caveat here is that SWE-bench Verified this is their own benchmark that they have a responsibility to maintain and keep the industry updated on, so it would be expected they report on this and not other random benchmarks.
randombsname1@reddit
Anthropic in general has punched ABOVE its weight vs benchmarks. At least from what I've observed since Claude 3 models a few years back.
I DO think swe rebench is probably the most benchmaxx resistant, but that one example doesnt really show anything when if you check the last data set you can see that GPT 5.2 medium also beat GPT 5.4 medium.
So.....?
spencer_kw@reddit
the only benchmark that matters is your own codebase. run the same refactoring task on 3 models, compare the diffs. takes 20 minutes and tells you more than any leaderboard.
kiwibonga@reddit
Still, good enough to verify that a local model is adequate for professional use and that no one needs to pay OpenAI or Anthropic hundreds of dollars for anything ever.
Exciting_Garden2535@reddit
This is the old-month news and has already been discussed. In this article, OpenAI explained why they switched to SWE Bench Pro. Some folk believed that; others did not, and said they did to avoid being compared with Opus. Anyway, other companies, including Antropic, now use SWE Bench Pro instead of SWE Bench Verified.