ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) | TheaterFire

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it)

Posted by klieret@reddit | LocalLLaMA | View on Reddit | 100 comments

There's been quite a few case studies recently on agents building whole programs from scratch, but most of them test a single or just a few projects with hand-tuned setups.

We've spent the last couple of months formalizing this setting and building a benchmark of 200 tasks while doubling down on testing, cheat prevention, and task diversity.

Our agent ONLY gets a target executable and some readme/usage files. The agent must choose a language, design abstraction layers, and architect the entire program. No internet access or any other way of cheating. No decompilation.

We've also spent some 50k to generate 6M lines of behavioral tests and then filtered them down to keep the best ones. Because they are just testing executables as a black box, we do not make any assumptions on even the language that the LM uses to implement the program.

All of the results are at programbench.com . There's also a big FAQ at the bottom.

We've just open-sourced our github, huggingface and docker images.

Essentially you can just start evaluating with pip install programbench && programbench eval <your submission>

Github is at https://github.com/facebookresearch/programbench

Sorry that it's just closed source models right now, we have a few open-source models in the pipeline, but so far they've had an even harder time at getting them to behave well with these tasks (open source models tend to be somewhat more overfitted to things like SWE-bench, so they often have a harder time with new benchmarks).

We're also planning to open the benchmark for submissions quite soon, similar to what we did on SWE-bench and its variants.

[-]

Foreign_Risk_2031@reddit

Do you give the models access to a kernel level debugger?

[-]

trueselfdao@reddit

Is the performance of the generated program considered? Do you think this might change the language preference?

Also, I am curious if this point can be substantiated further.

These [programming language] preferences likely reflect differences in training data composition and instruction tuning rather than task-level signals, as the same tasks elicit different language choices from different models

While it may be due to different training and instruction, the question of when and based on which task parameters the model chooses the language seems somewhat interesting to me given the skew.

[-]

klieret@reddit (OP)

Performance is only considered indirectly: If agents do something real stupid and exceed total agent runtime, they're killed.

But would be cool to have performance extensions of the benchmark!

But first, agents will have to solve instances fully 😉

[-]

RegisteredJustToSay@reddit

Why no reverse engineering? I figure it reduces the complexity somewhat (especially if you allow the solution to be in the same language as the source binary was written in pre compilation ) but as someone who's been involved in similar work in the past it would actually help make it more realistic. Knowing the very specific control and data flow is really important for nailing complex nuances in program behavior - especially when it comes to data formats. I didn't dive deep on how well documented those quirks are in the bench though.

[-]

klieret@reddit (OP)

tl;dr because we started from the question "Can LMs build programs from scratch", rather than "How well can LMs patch together bits of decompiled code".

[-]

RegisteredJustToSay@reddit

I guess that's fair, I'd just be concerned at how much they are penalized for now recreating subtle behavioral differences rather than the functional requirements. Like I said I didn't look as deep as I'd like to yet but high level that all sounds reasonable to me.

[-]

tuit_and_shout@reddit

Can any human do that at 100% with an infinite amount of time? What % would achieve? Or, how many humans are needed to complete the test at 50% in the same amount of time that any language model needed?

[-]

klieret@reddit (OP)

With inifinite amonut of time? Definitely you can get to 100%. We believe that all tasks are solved by design.

There's some simple tasks that are realistic for a single human to implement in a couple of days, but getting to the tail end of the difficulty spectrum for things like ffmpeg, we'd probably be talking years.

[-]

klieret@reddit (OP)

We have a bunch of people here to answer any questions. Oh and here's a bigger leaderboard with some more details (cost and calls are per instance). Sonnet is the most expensive one here, we spent almost 5k on that run. Important point is also that we barely killed any agent at all, they almost all declared they were done and submitted.

[-]

jazir55@reddit

Why prohibiting decompilation? Are they able to actually read the binary from the compiled executable? Is this simply a case of "they actually can't read the code in the first place" and thus they're just guessing? How else are they supposed to figure it out without diagnostic tools? Seems like a flawed benchmark.

[-]

klieret@reddit (OP)

They can access and explore the executable and they do have some usage docs. Enough to explore everything. The reason for no decompilation is that we want to interpret scores as "how good are models at building stuff from scratch" rather than "how good is decompilation"

[-]

jazir55@reddit

explore the executable

Explore how? My question is centered on how are they analyzing it? Are they actually able to see the code contained within it, or infer anything about its contents? If the file is a black box, do they use vision to try to recreate the program from how its GUI functions? If the file isn't a black box, then how do the models get visibility?

[-]

klieret@reddit (OP)

They can just run it! The program is given to the agent and is executable, just not readable (that's a cool feature of the linux kernel, you can execute things without necessarily having read permissions on it). E.g., let's say you wonder how `jq` works with a specific json file. Then the agent can just create a sample json file and call `jq` on it.

[-]

jazir55@reddit

Ok so it is what I was thinking except for CLI, they see what the function is without knowing how the program does it and then tries to replicate the feature?

[-]

klieret@reddit (OP)

correct, but the docs tell it the big picture and then they need to experiment to explore & start replicating

[-]

Adrian_Galilea@reddit

Interesting.

Would love to see a couple open models.

Glm 5.1, deepseek v4 pro… :)

[-]

CryinHeronMMerica@reddit

Maybe even a llama model, given this was made by Meta 😂

[-]

klieret@reddit (OP)

We will have results for Muse Spark once it's fully available on the API (https://ai.meta.com/blog/introducing-muse-spark-msl/)

[-]

klieret@reddit (OP)

yeah, sorry, we're definitely working on this right now, hopefully we'll have that by next week. It's just that open models tend to behave somewhat worse on new tasks (they tend to be a bit more overfit on benchmarks)

[-]

kettal@reddit

what are the scores when you turn on internet access, and allow binary analysis?

[-]

klieret@reddit (OP)

It's not too much different, but there's a lot of cheating (looking up true solutions) going on with internet access, so we need to have a LM as a judge to remove those instances. We write about this in https://programbench.com/static/paper.pdf section 4.1

[-]

Accomplished_Mode170@reddit

This is awesome. TY; stoked to look at GHidraMCP prompting

[-]

klieret@reddit (OP)

For the purpose of this benchmark, we make it impossible to decompile, because we want this to be a proxy for "true architecture from scratch without help", not a decompilation benchmark. We remove read access to the executable, so you can't even call `strings` on it or anything.

[-]

klieret@reddit (OP)

(https://programbench.com/#faq-decompilation and also we talk a bit more about inference settings in https://programbench.com/blog/is-programbench-impossible/)

[-]

Perfect-Campaign9551@reddit

That's not cheating. It's exactly what a human would do

[-]

klieret@reddit (OP)

It's cheating because it defeats the purpose of the benchmark and because the prompt says "you MUST not look up the original source code". Or you could call it instruction following failures. Either way, should be 0d out.

But even without zeroing it out, it's not like solve rates jump that much higher.

[-]

klieret@reddit (OP)

Binary analysis we don't really have an official ablation, but in our first runs we only discouraged this and models would throw like `strings` etc. at the binary. Didn't seem to make scores jump all that much tbh. The reason we disallow it is less about making this artificially hard, it's that we don't want to measure decompilation capabilities, but more like "what happens if you ask LMs to build a large project from scratch without giving any kind of reference structure, so that it truly has to make all architectural decisions". And decompilation just gives back some of this structure, that's why we don't want that

[-]

kettal@reddit

i just want it to reverse engineer excel.exe

[-]

bad_gambit@reddit

Since this is made by Meta, i'm assuming Meta have curated significant dataset on these programs? Now I'm curious with how muse spark performs? 🤔😆

[-]

Mochila-Mochila@reddit

Wow, congrats, that's really cool and certainly needed to better measure what full vibe coding can actually achieve !

I also don't understand the pushback ITT against the "no Internet" rule. I'm not a coder but... surely, computer programs existed and were hand-coded before the Internet was even born ? So why would it be unfair to expect a supposedly advanced software to do just that ?

But anyway, perhaps two variants coule be made - ProgramBench Pure (no Internet) and a ProgramBench Easy (Internet allowed) - to please all.

[-]

klieret@reddit (OP)

we have the ablation in the paper! it's just that evaluating with internet means you have to disqualify a lot of solutions with LM as a judge because of cheating and that's usually not great for benchmarks

[-]

CalligrapherFar7833@reddit

Interesting no per effort breakdown of the models also gpt 5.5 not testes

[-]

klieret@reddit (OP)

5.5 is in the works. doing reasoning breakdown for one, we'd have to do for all, so that takes some time

[-]

nuclearbananana@reddit

Are models able to see the tests? Otherwise they're just guessing at the scope of its capabilities

[-]

klieret@reddit (OP)

models cannot see any of the tests. We do however ship usage documents, and most executables do have `--help` etc. So definitely wouldn't say the models have to guess anything

[-]

Technical-Earth-3254@reddit

This kinda fits my experience with vibe coding. But since I know how to code and just tell the AI where to put what, I can work with way less powerful and inexpensive models and still get a lot more throughput than when I'm doing it myself. I love AI assistance.

[-]

under_psychoanalyzer@reddit

What are you using for an IDE?

[-]

Technical-Earth-3254@reddit

VS Code mostly

[-]

siete82@reddit

Are you using Continue or any other extension?

[-]

under_psychoanalyzer@reddit

Cool, I just like to spot check what preferences people have. I liked my cursor trail but if I'm going to pay for integration I'd rather just pay anthropic for Claude code directly

[-]

Technical-Earth-3254@reddit

I never understood Cursors pricing, so I never used it tbh. I like VS Code for the modularity. I have a ChatGPT Plus sub and the Codex extension, but I'm using it maybe 2 times a month when I'm not finding a bug or want a second opinion on something. Previously I used GPT 5.1 Codex Max, but that's gone now, so I'm going with 5.3 Codex.

[-]

draconic_tongue@reddit

That sounds nice. What I don't like so far is having an idea but not knowing how to do it, or after doing it not knowing if it was actually a good idea or not, and not being able to fully trust the llm about its assessment on it. I've spent weeks being stuck thinking and talking about one part of the codebase I'm working on because words is all I have. But I'm not complaining that much since it's better than not being able to do anything

One thing I definitely don't like so far is workflows where people let the llm commit and push changes without looking at the code beforehand. Maybe they already know what they are doing so they don't need to, idk. Since that's not the case for me, I find it pretty annoying when some workflow I try (latest was beads) adds instructions to my agents.md that tells the llm to always push the changes, and then when I look at it I find that it did something I don't like. I can only really imagine doing that for things I don't give a shit about.

With codex that has been my experience a few times already in general. I feel like we reach an agreement only to find out it did something different than what I expected and then I need to "scold" it but it feels like I'm scolding myself since I'm the one that doesn't know what the fuck is going on

Still firmly believe it's better than studying for years though when you just want to do something, but it can be pretty painful. I also can't imagine actively picking up a language to learn, and code is also technically language, so if ai stuff turns out as a gateway to getting good at coding, just like playing games or using the internet is good for learning english, that would be pretty good

[-]

rpkarma@reddit

AI can absolutely be super useful for teaching coding! But not in its default mode unfortunately. You have to write a prompt/AGENTS.md/persona whatever to get it to be in “teacher, not generator” mode, where it gives you pseudo code and pointers to APIs and documentation and language idioms.

Then it’s a force multiplier for teaching :)

[-]

IrisColt@reddit

Thanks!!!

[-]

Drenlin@reddit

I think it's at its best when you're somewhere in the middle. I can't code by myself, but I can look through what the LLM writes and figurenout what each block of code does well enough to question its decisions or recognize some signs of bad code.

Could I do this well enough to support an enterprise product? Absolutely not. Does it save hundreds of hours by letting me solve a local problem without learning to code in three languages? Absolutely yes.

[-]

IrisColt@reddit

This.

[-]

Glazedoats@reddit

I want to end up learning coding, without the vibing, so I can instruct the model better. :]

[-]

IrisColt@reddit

Wow, I've got some janky tests that flirted with this idea, but not in nearly as ambitious a way. My personal experience is that, given that the LLM is allowed to do everything (except for that little list of conditions against searching the web or decompiling), it ends up getting lost in all that freedom... Also, the tendency to default to Python is maddening.

[-]

Sabin_Stargem@reddit

For my part, I am hoping to someday have AI analyze old games and reconstruct them, for newer platforms.

Operation Inner Space from the Windows 3.1 era is the sort of thing that would go well with a gamepad, plus other things. Say, for example, your media library being used as map backgrounds or as the soundtrack for the area.

OIS was interesting, because it had pickups based on your actual OS files. You could blast the Internet Explorer.exe into shards, then get pursued by donut-loving enforcers for daring to harm Microsoft property.

[-]

DuranteA@reddit

Great work, and remarkably forward-looking!

Interestingly, this is one of the rare benchmarks that has Sonnet 4.6 substantially outperforming (if you consider partial success) GPT and Gemini. That matches my experience actually trying to work with these models, but it doesn't match scores in most benchmarks. Obviously just an anecdote, but still noteworthy IMHO.

(Those benchmarks that seemingly overrate GPT / underrate Sonnet include my own on parallelization for HPC, the paper about which just got accepted into Europar -- I have to get a preprint up)

[-]

DramaLlamaDad@reddit

How can I be the only one who is frustrated by seeing a post like this? How many actual coders would be able to complete this task under the same restrictions? How does this change when you actually have a competent engineer driving and with internet access? What is this really supposed to be proving? Why have a benchmark designed with the intention of it being used the exact way we've all been saying it shouldn't be used and then show us a table full of 0% results? Just... frustrating.

[-]

klieret@reddit (OP)

To give a bit of context: (1) There were a number of case studies that essentially claimed that tasks like this are basically getting solved by LMs now. Our larger scale benchmark puts a questionmark behind this (2) We have an ablation regarding internet access in the paper and find that it opens the door to cheating, but otherwise doesn't make the scores jump to crazily (3) What is difficult for humans for LMs is different, so it's very hard to say what's impossible and what isn't. For SWE-bench nobody wanted to work on it at the beginning because people thought it was impossible. (4) Not all of these tasks are that hard, we absolutely expect some of them to be solved quite soon.

More: https://programbench.com/blog/is-programbench-impossible/

[-]

DramaLlamaDad@reddit

Can you link to some of those studies that claimed that LMs (or ANYONE) should be able to do these things in a vacuum? Nothing in the past 30 years was created that way. To some extent, almost everything is built on existing knowledge and previous work.

[-]

klieret@reddit (OP)

https://www.anthropic.com/engineering/building-c-compiler

[-]

Former-Ad-5757@reddit

Literally first sentence : We tasked Opus 4.6 using agent teams to build a C Compiler, and then (mostly) walked away. Here's what it taught us about the future of autonomous software development.

MOSTLY walked away...
Disregarding that they don't claim to use your agent-harness, they don't claim no internet.

But your test is nice, I just think it is a little too hard yet, we are still in the age of context rot, it is a new kind of test which will not exist much in training data, so it will need to use a lot of context to find the way to handle your test, and when (/if) it has figured it out, then it has run out of good context.
In current times I would suspect this to rely heavily on the harness (/memory in the harness) to get around the exploration to context rot

[-]

klieret@reddit (OP)

https://cursor.com/blog/scaling-agents

[-]

buttplugs4life4me@reddit

I think in the future when you mention you restricted internet access, you should also mention that you didn't see any score increases with it on.

From my POV I could probably not implement a whole program without asking SO at least once. But I'm also not an LLM and some of the stuff they've done (like look into the installed Nuget package code via decompilation) in my projects is not something I would've done when documentation is available online.

So when you wrote you disabled Internet access, I was ready to kind of dismiss your results until I read this comment. Of course I could've read more before, but people are busy nowadays

[-]

klieret@reddit (OP)

I get where you're coming from, but as someone who's working on benchmark, I get super suspicious of anything that has internet access, because models get super sneaky with cheating.

[-]

buttplugs4life4me@reddit

Yeah I can understand that. It would probably be easy for a model to find the exact problem or even query another model if it has full browser access lol.

I mostly think a disclaimer of internet access not really helping would help you be taken more serious than a blanket "We disabled internet access"

[-]

chigur86@reddit

In addition to the replies from OP (I wholeheartedly agree with those) there’s another implication for those of us trying to build startups on top of these models: if your workflow hasn’t been RL’d into the base model, then there’s a chance of your survival. On the flip side, if the big lab post training eye of Sauron gazes in your direction then you are toast. Finally, benchmarks are how progress in ML happens. I am pretty sure soon this one will outlive its utility.

[-]

crantob@reddit

Sir, this is a LocalLLaMA.

[-]

Distinct_Fox_6358@reddit

I’m curious about the score of GPT-5.5 xhigh.

[-]

klieret@reddit (OP)

def working on evaluating GPT 5.5 and a couple of others next!

[-]

tomobobo@reddit

Cool idea, I really like the approach of keeping the model in the dark and letting them struggle it out without the ability to "cheat".

I've always had thoughts about if all of the same efforts put into image models were done with program executables, we could skip the whole vibe coding process all together and just have a model spit out working binary.

I think the concept is so alien though to the models as this isn't what their training consists of, so I'm not sure how any model will ever get better at this benchmark without being aware of even one successful example.

And at that point, to instill the concept into the model, do you think it would be easier to just say, train a diffusion model on binary applications?

[-]

anotherthrowaway469@reddit

This looks great, much more useful than the various toy benchmarks that are so popular. It would be super useful for comparing agent/harness performance too.

[-]

klieret@reddit (OP)

agreed. So far we deliberately evaluated with mini-swe-agent, because it's much less overfit to any specific task (see also here https://programbench.com/#faq-agent-scaffold), but we're thinking about opening up submissions for any agentic system, much like SWE-bench initially

[-]

anotherthrowaway469@reddit

Cool! IMO it's worth looking into running your own dedicated agent benchmark too, IME it can matter even more than the model sometimes.

[-]

chigur86@reddit

Great work! One suggestion for the leaderboard: a separate board for meta agents that evolve agent harnesses. Since the tasks are verifiable, it’s straightforward to dump test time scaling strategies at it. Thus, one can ask: what’s the cost to reach a certain accuracy?

[-]

klieret@reddit (OP)

We're gonna open up for submission sometime soon! We're still thinking about the exact rules of submissions, but would probably do it similar to what we did with SWE-bench, so any harness that doesn't outright cheat would be allowed.

[-]

IllustriousLength991@reddit

that's a strong direction for agent evaluation because it tests end-to-end program instead of narrow coding tasks. The no-internet, no-compilation, executable-only setup also helps reduce shortcutting and benchmark gaming. A 200-task suite with behavioral tests sounds more meaningful than one-off demos. The main limitation right now is that results are mostly closed-source models, so it’ll be interesting to see how open models perform as they catch up.

[-]

klieret@reddit (OP)

we're working on it! Open models are a little bit more work right now because they typically are a little bit more overfitted on established benchmarks, so they behave weirdly on something new like this

[-]

Able_Zombie_7859@reddit

Is it treating these builds as staged and architected or just trying to do it? Building new apps doesn't work that way either, is it building a plan and phases of production with internal reviews and test during and after phases like bmad for example? I don't think anyone would expect any sort of result without a more complex agentic planning and execution, noone should expect this to work with just "here is the binary and some docs, go!"

[-]

klieret@reddit (OP)

why should this not possible? When SWE-bench was launched, the common criticism was "this is impossible, noone will evaluate on this". ProgramBench already has a few instances that we consider "almost" solved. And yes, multi-agents might push this further, but that would be great, because then this would be one of the first benchmarks that really show the limits of single agents.

[-]

Able_Zombie_7859@reddit

Because the planning and tokens for planning an entire app need to be broken into structured planning phases, this is true for building a new app or cloning one, in fact there is no differences, both are building an app from scratch with direction but no actual code. No one is making apps from scratch this way, the expectation apps could be cloned this way will (and has) end in similar results

[-]

klieret@reddit (OP)

We wrote some more about this here https://programbench.com/#faq-agent-scaffold and here: https://programbench.com/blog/is-programbench-impossible/ .

We'll also be opening submissions for other agentic harnesses soon. We'd be quite excited if this benchmarks clearly shows the limits of single agent systems

[-]

metaden@reddit

I want to see the implementations of these model. Is that possible? or are they private to avoid contamination?

[-]

klieret@reddit (OP)

we currently don't have them public, but might release them in the future. What's open are docker images & the eval harness & the tests (hugginface dataset). Planning to release the baseline agent system at https://github.com/SWE-agent/mini-swe-agent soon.

[-]

2Norn@reddit

my small brain kinda thinks showing 0% resolved (at least right now when none of them can fully resolve a single task) is a bit silly

maybe showing the average score would be better? or maybe some internal scoring algorithm based on task difficulty or language idk...

good benchmark tho, and we deffo need good benchmarks... hopefully this fleshes out and becomes the new norm for both models and harnesses.

[-]

klieret@reddit (OP)

There's some more results here: https://programbench.com/extended/

We thought a long time about how we should be scoring. The big problem with average score is that not all tests are created equal, and even a single failed test could reveal a big flaw in a program. So going for % unit tests passed on avg is just a very misleading metric. Same reason we didn't do this with SWE-bench (you need to pass all tests to mark a task as resolved).

ProgramBench tasks are super diverse, there's some that we 100% expect to be solved in the coming months and some that we expect to remain too challening for a long time.

[-]

Foreign_Risk_2031@reddit

Facebook posting where llama was leaked is so funny

[-]

knoodrake@reddit

I like this benchmark ! ( if it doesn't end up being trained on )

[-]

klieret@reddit (OP)

nobody should train on the benchmark. But people will create training data that's similar. But that's also what leads to better LMs. Benchmarks will always be a little bit overfitted, but for example training on SWE-bench-like tasks is what gave us coding agents

[-]

Opening-Broccoli9190@reddit

Thank you, a very interesting benchmark. What do you think the real limitation is? Is it because a model currently cannot navigate a binary like a human does, therefore it cannot build a mental map of the product?

[-]

klieret@reddit (OP)

great question. We don't fully know. With new frontier benchmarks, part of it is always that they fall outside of the training data distribution, so models can do surprisingly stupid stuff. This was the main reason why e.g., initial SWE-bench scores were super low and scaffolds like our SWE-agent needed to be quite complicated to make up for all of the model quirks. Then model providers started to train on similar tasks, and suddenly you can get away with minimal stuff like our https://github.com/SWE-agent/mini-swe-agent/

So I'm sure this is part of the answer here. Because some of the tasks really should be solvable. Like yes, there's also some TUIs etc., but there's also something like `jq` which is in some sense quite simple in that it takes input and gives output.

[-]

Perfect-Campaign9551@reddit

No Internet access is kind of dumb imo. Models aren't going to have everything built in. They needed context to work, and web searches are part of that

I don't know what you're trying to prove with "no Internet access".

[-]

klieret@reddit (OP)

we have an ablation that tests running the agents with internet access. the scores weren't much higher: https://programbench.com/static/paper.pdf section 4.1. The main issue with internet access is that models cheat like crazy and it's very hard where to draw the line. When Anthropic blogposted about Claude building GCC from scratch, this was also without internet. The task definitely gets harder, and near impossible for a human. But probably not for the next generation of LMs.

[-]

divide0verfl0w@reddit

Looks like the agent was able to vibecode the website though.

On mobile programbench.com website expands the dropdown for the Github menu item while scrolling as if it’s hovered. Expand a few of the FAQ bullet points and try scrolling up and down.

Probably ids/classes shared across collapsible components. Kudos for somehow attaching it to a scroll offset or scroll event though!

[-]

klieret@reddit (OP)

oops. let me fix that. Appreciate the report!

[-]

Baphaddon@reddit

Hmmm well to be fair does a person need 100% of a programs capabilities all the time?

[-]

zqkb@reddit

Nice! Were there some examples of simpler applications which were fully solved with the current models?

If we were to use the benchmark to identify the 'frontier complexity' of the application which is currently possible to recreate, what would that be? We can expect some 'hello world app' to work, and your current dataset is beyond model ability.

[-]

klieret@reddit (OP)

we didn't exclude any applications, we just selected repos that had reasonable numbers of stars contributors etc. It's just that a "hello world" application never gets many stars, so it also never found it's way into the benchmark. But I think this is actually good like this, because that way the % resolved has a cleaner interpretation (rather than artificially introducing easy instances)

[-]

DataGOGO@reddit

yes, but you can't one shot it like you are attempting to do.

[-]

klieret@reddit (OP)

it's not a one shot, we allow up to 1k agentic steps and I think no model but a single one ever even got there. they all were like "we're done, here's your binary" and the executable they built is just incomplete. So I think it's absolutely fair

[-]

SangersSequence@reddit

I'd be curious if they can get there with a "supervisor" model that attempts to run the produced binary and feeds back the result.

[-]

DataGOGO@reddit

Oh it is absolutely fair, sorry if I was not clear. I agree the models are just not there yet, they need a lot more task division and supervision or they will just get lazy and tap out.

[-]

klieret@reddit (OP)

Oh yeah, that's totally fair. We definitely also expect that multiagent setups will push the frontier at that benchmark. But to my knowledge, this is one of the first benchmarks where single agents really seem to come at their limits (because e.g., on SWE-bench multiagent systems never consistently outperformed the best single agents). And that alone makes us excited! Because we really hope that this reinvigorates agent scaffold design studies

[-]

Hoak-em@reddit

Can you use custom harnesses in this? I’d want to test it against opencode with various plugins and forgecode

[-]

klieret@reddit (OP)

yes, it should be very simple. We've published all inference containers (see here https://github.com/facebookresearch/ProgramBench/tree/main/docs).
There's a couple of things to take into account (no internet etc.), but other than that.

[-]

klieret@reddit (OP)

We're hoping to open for submissions soon (rules yet to be determined, we want to make sure that all submissions are relevant and don't cheat).
We'll also publish our baseline inference setting in https://github.com/SWE-agent/mini-swe-agent/ this week

[-]

vnjxk@reddit

This is a good benchmark, I was about to write that I hope it will be taken seriously, but then I saw that it's by facebookresearch, so that's some good news